Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo
Matt Maurano et. al, Nat. Genetics 2015

CATO (Contextual Analysis of TF Occupancy) scores

CATO scores pre-computed for dbSNP

  1. We recommend considering the CATO score for SNPs outside a DHS to be 0.

  2. While CATO scores are not themselves cell-type specific, the cell types with a DHS are listed in the Cell_types column. For studies focusing on a set of predefined subset of relevant cell types, SNPs without a DHS in the appropriate cell type should be treated as having score = 0. This can also be done by intersecting the CATO scores with DHS tracks available below using `bedops -e`.

R linear models

RData object containing two lists of models which can be used with the predict() function in R:

Master list of DHSs

These master lists were used for MCV (standing for "multi-cell verified" and indicating cell-type selectivity) calculations (in conjunction with `bedmap --count`).
Compressed using starch format -- install BEDOPS package, and type `unstarch aisamples.dhs.hotspots.starch`
The .bed name (4th) column contains the sample name. To make individual .bed files per cell type, try:
unstarch aisamples.dhs.hotspots.starch | awk -F "\t" 'BEGIN {OFS="\t"} {print > $4 ".bed"}'
The 5th column contains the filtered DNase-seq tag count in the listed sample, and the 6th column is the count normalized to 1M tags per sample. These master lists include a selected set of malignant or immortalized lines that were utilized in certain analyses.

TF Clusters

We have organized PWMs from major databases (TRANSFAC, JASPAR, UniPROBE, Taipale) into clusters of similar sequence specificities by clustering TOMTOM similarity scores:

Mapped tags

Per-sample bam files are available containing the filtered tags used in the analysis.


Partially filtered genotypes [ VCF (3GB) | .vcf.tbi | .vcfidx ]

Note that these genotypes represent an intermediate analysis file. Please see the description in the Online Methods for further details, and consider restricting any analysis to the SNPs in Supplementary Data Set 1.

Allelic imbalance per cell type

Format: one file per cell type. bed3+boolean (whether site was imbalanced in that cell type or not); snps.multicell.bed contains bed3 data describing all SNPs tested across at least 1 cell type [ tgz ]

These are the data from Fig 3b-e

Errata and comments on the published manuscript