CAGEd-oPOSSUM Help

Contents

Overview
Species and Assemblies
General Description
Basic Algorithm
Pre-computed vs. on-the-fly analysis
Statistical Analysis
Z-score
Fisher score
Understanding the Results
Detailed Description of Input Options and Results
Download Software
FAQ
Citing CAGEd-oPOSSUM



Overview

CAGEd-oPOSSUM is a web-based tool which may be used to determine the over-representation of transcription factor binding sites (TFBS) within CAGE peak regions. The input consists of a foreground (target) set of CAGE peaks, a suitable background set of CAGE peaks (or compositionally matched random genomic background) and a set of TFBS profiles. Optional filters may be applied to both the target and background CAGE peaks and TFBS search parameters specified. CAGEd-oPOSSUM then compares the frequency of binding sites for each transcription factor (TF) in the target versus the background and the degree of over-representation is measured statistically.

Species and Assemblies

FANTOM5 provides expression data for human and mouse. All coordinates used in CAGEd-oPOSSUM are based on the human GRCh37 (hg19) or mouse NCBI37 (mm9) assemblies.

General Description

The target and background CAGE peaks may either be selected from the FANTOM5 CAGE peak data or supplied as your own "custom" CAGE peaks from some other source. If FANTOM5 CAGE peak data is used, it may be specified by either selecting one or more samples from the FANTOM5 sample ontology tree along with a minimum level of expression of those CAGE peaks within the selected samples or by providing a list of specific FANTOM5 CAGE peak IDs. If custom CAGE peaks are provided, the CAGE peak coordinates are specified in BED format. The background may also be generated as a random set of genomic sequences which are %GC composition and length matched to the target set using the HOMER software (http://homer.salk.edu/homer).

Varous optonal filters may be provided. For FANTOM5 CAGE peaks, these filters include (1) limiting CAGE peaks to only those classified as TSSs by the FANTOM5 TSS classifier (http://fantom.gsc.riken.jp/5/datafiles/phase1.3/extra/TSS_classifier/TSSpredictionREADME.pdf), (2) limiting CAGE peaks to only those associated to genes in a provided list of gene identifiers, or (3) limiting the CAGE peak TFBS search regions to only those portions which overlap a provided set of filtering regions. If you use your own CAGE peaks, then only the filtering by a set of regions option is available. If more than one filter is provided then only CAGE peaks and/or portions of the CAGE peak regions which pass ALL the filters or retained.

The TFBS search parameter options include which transcription factor binding site profiles to use, what scoring threshold to apply to these binding sites, how much flanking region to apply around the CAGE peaks in which to search for TFBS and how the results will be displayed. TFBS profiles may be specified by selecting from the set of JASPAR CORE vertbrate profiles or by providing your own TFBS profile position frequency matrices (PFMs).

For each transcription factor (TF), the system uses two different statistical measures of significance of TFBS over-representation. The Z-score compares the number of binding site "hits" in the target set against the number of hits in the background set, whereas the Fisher score compares the proportion of target sequences (CAGE peak regions) containing at least one TFBS with the proportion of background sequences containing at least one TFBS. The relative rankings of these two scoring methods can thus be used to determine which TFs are considered over-represented in the target set.

For a detailed explanation of the various input options and output results formats, please see the Detailed Description of Input Options and Results page.

Basic Algorithm

Once you have selected target and background CAGE peaks (or ranomly generated %GC composition and length matched regions), any applicable target or background filters and TFBS search parameters, the analysis is launched and the following steps take place.

Any applicable filters you chose are applied. Filters may have been provided for either or both the target and background CAGE peaks. The one exception is, that in the case of a randomly generated %GC composition and length matched background, no filters are applicable. If you chose to filter the CAGE peaks by TSS status, then any CAGE peaks not classified as TSSs by the FANTOM5 TSS classifier are removed from the set. If you chose to limit CAGE peaks to those associated with specific genes, then any CAGE peaks which are not associated with any of the genes provided are filtered out. NOTE: these two filters only apply if you chose FANTOM5 CAGE peaks in the first step. Filtering by genomic regions applies to both FANTOM5 CAGE peaks and your own "custom" CAGE peaks. This filter is applied later in the process as described in the next paragraph.

CAGEd-oPOSSUM then applies the upstream / downstream flanking regions (chosen in the TFBS search parameters step) to each of the CAGE peaks to create initial CAGE peak regions. CAGE peak regions which overlap are merged together into larger regions. If you chose to filter the CAGE peaks by a set of genomic regions (this option is available for both FANTOM5 CAGE peaks and your own "custom" CAGE peaks), then these filtering regions are applied and only the portions of the CAGE peak regions which overlap the filtering regions are used in the anlysis, i.e. the intersection of the merged CAGE peak regions and the filtering regions is used as the final set of regions to search for TFBS.

CAGEd-oPOSSUM then uses the selected TFBS profile matrices to scan these regions for putative binding sites which score above the selected score threshold. By comparing the frequency of TFBS in the target set of CAGE peak regions to the frequency of TFBS in the background set, a measure of the degree of over-representation of each TF's binding sites is obtained. The results display the rankings of the TFs degree of over-representation. Two different statistical tests of over-representation are applied to obtain these rankings, the Z-score and the one-tailed Fisher exact probability. This are described in more detail in the Statistical Analysis section below.

Pre-computed vs. on-the-fly analysis

A pre-computation was performed in which flanking regions of 2000 bp (the maximum allowed in the analysis) were applied both up- and downstream of each FANTOM5 CAGE peak. Any overlapping regions which resulted were merged to form a set of maximally spanning non-overlapping regions. The sequences corresponding to these regions were retrieved and scanned with all JASPAR CORE vertebrate TFBS profiles which have an information content (specificity) of at least 8 bits (the minimum allowed in the analysis). The predicted binding sites which had a relative motif score of at least 80% (the minimum allowed in the analysis) were retained and all maximally spanning regions and predicted binding sites were stored in the CAGEd-oPOSSUM database.

For analyses in which you select FANTOM5 CAGE peaks and also choose JASPAR TFBS profiles, the pre-computed TFBS are retrieved directly from the database. For analyses in which you use your own custom CAGE peaks or custom TFBS profiles or in which you use a randomly generated compositionally matched background, CAGEd-oPOSSUM performs the computation described above on-the-fly. Note that the target and background TFBS are computed independently, so it is quite possible that the target set is retrieved from the pre-computed database and the background is computed on-the-fly (or vice versa). Analyses that are able to retrieve pre-computed TFBS stored in the database will generally complete more quickly.

Statistical Analysis

Z-score

The Z-score compares the frequency with which binding sites occur in the target set with the frequencey with which they occur in the background. As a way to allow comparison between transcription factors with differing binding profile widths, the calculation is "normalized" by comparing the frequency of the nucleotides which comprise the binding sites rather than just comparing the frequencies of the binding sites themselves.

Mathematically, the Z-score uses the normal approximation to the binomial distribution to compare the rate of occurrence of TFBS nucleotides in the target set of CAGE peak regions to the expected rate estimated from the background.

For a given TFBS, let the random variable X denote the number of predicted binding site nucleotides in the target set of CAGE peak regions. Let B be the number of predicted binding site nucleotides comprising the background set of CAGE peak regions.

Using a binomial model with n events, where n is the total number of nucleotides examined (i.e. the total number of nucleotides in the target CAGE peak regions), and N is the total number of nucleotides examined from the background CAGE peak regions, then the expected value of X is μ = B x C, where C = n / N (i.e. C is the ratio of sample sizes). Then taking P = B / N as the probability of success, the standard deviation is given by:

σ = sqrt(n x P x (1 - P))

Now, let x be the observed number of binding site nucleotides in the target CAGE peak regions. By applying the Central Limit Theorem and using the normal approximation to the binomial distribution with a continuity correction, the z-score is calculated as:

z = (x - μ - 0.5) / σ


Fisher score

The Fisher score is based on one-tailed Fisher exact probability. In contrast to the z-score, for a given TF, the one-tailed Fisher exact probability compares the proportion of CAGE peak regions containing at least one predicted binding site to the proportion of the background set that contains at least one predicted binding site to determine the probability of a non-random association between the CAGE peak regions and the TF of interest. It is calculated using the hypergeometric probability distribution that describes sampling without replacement from a finite population consisting of two types of elements. Therefore, the number of times a TFBS occurs in a specific CAGE peak region is disregarded, and instead, the TFBS is considered as either present or absent. Fisher exact probabilities were calculated using the R Statistics package (http://www.r-project.org/). Negative natural logarithms of the probabilities are used as the Fisher scores.

Understanding the Results

In general, the scores are used to rank the over-representation of a TFs putative binding sites from the most strongly over-represented to the least, to aid you in selecting potential TFs of interest. The scores are dependent upon a number of factors, one of which is the number of CAGE peak regions analyzed, thus comparing scores between different analyses should not be done unless the number and length of regions used in the analyses are similar. Another factor is your selection of background. If your background does not have a similar nucleotide composition as the target set of CAGE peak regions, you risk biasing your analysis to a subset of TF motifs. To detect whether a bias has occurred a simple visualization is to plot the Z-/Fisher score (y-axis) against the TF profile's GC content (x-axis). The system will automatically generate these plots for you. If you see a skew (obvious slope to the distribution of scores in the plot) such that the TF matrices with a high (or low) GC content to all have the highest ranking scores, then you may need to go back and chose a different background set.

There is no specific threshold that can be recommended for any one data set; however within the results of an analysis you may be able to select a group of interesting TFs by the relative value of the scores. In general a good overview of your results can be obtained by the graphical plots of Z- or Fisher score versus the %GC composition of the TFBS profiles. In our experience, a clear segregation of scores is the most reliable indication of functional relevance.

Detailed Description of Input Options and Results

For a much more detailed description of the input options and results please see the Detailed Description of Input Options and Results page.

Download Software

To download the CAGEd-oPOSSUM software and data please see the Download page.

FAQ

Frequently asked questions page is here: FAQ

Citing CAGEd-oPOSSUM

If you use CAGEd-oPOSSUM in your work, please cite:

Arenillas DJ, Forrest AR, Kawaji H, Lassmann T; FANTOM Consortium, Wasserman WW, Mathelier A. CAGEd-oPOSSUM: motif enrichment analysis from CAGE-derived TSSs. Bioinformatics. 2016 Jun 9. PMID: 27334471.

If as part of your analysis you also used the HOMER software, either to generate random backgrounds or to perform HOMER motif analysis, please also cite HOMER:
Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432