Chapter 2 Workflow Description

The GRAVI workflow performs multiple steps, most of which depend on those conducted previously. The workflow management software snakemake(Mölder et al. 2021) is used to run the complete workflow, giving the capacity to be run on local servers or HPC systems.

The snakemake DAG (directed acyclic graph) of the workflow is always included in the compiled document, however a simplified version is presented below.

Figure 2.1: Simplified DAG of the GRAVI workflow. Key steps are numbered and shown in black. Inputs are shown in blue, whilst key outputs are shown in coral.

The workflow produces a series of HTML reports as a larger webpage, inspired by the excellent workflowr(Blischak, Carbonetto, and Stephens 2019) package but instead relying directly on rmarkdown::render_site()(Allaire et al. 2022). The compiled html pages will be placed in the docs directory, with all source rmarkdown files being placed in the analysis directory. Key additional outputs (e.g. bed/csv files) are placed in the output directory. The standardised directory layout is shown in Section 4.1. The complete R environment from every compiled HTML page is also saved in output/envs, however these are marked as temporary files by the workflow and will be deleted when removing temp output (snakemake --delete-temp-output --cores 1)

2.1 Annotation Setup

Genomic Regions

Defining genomic regions is a key part of the setup for analysis. Under the GRAVI workflow, a series of non-overlapping genomic regions are defined which characterise the most likely role/aspect for each specific region. As these will be unique to each Gencode build, and also include any optional RNA-Seq data, this step is performed for every experiment. This step follows the default process as defined in the function defineRegions() from the Bioconductor package extraChIPs

Promoters: By default these are defined as -1500/+500bp from every distinct TSS. If promoters from two or more transcripts overlap, they are merged into a single promoter. These ranges can be easily changed to increase or decrease the size using the YAML as described in Section 5.3
Upstream Promoters: These are extended promoter regions up to 5kb upstream by default, which again can be changed to suit
Exons are defined as any exon not overlapping a promoter or upstream promoter
Introns are defined as any transcribed sequence not overlapping a promoter, upstream promoter or exon
Intergenic Regions: are divided into two subsets, neither of which are permitted to overlap any previously defined regions
- Within 10kb of a gene. Again this distance is customisable
- Beyond 10kb of a gene.

During the characterisation of all the above regions, mappings to associated genes and transcripts is retained and included in the subsequent GenomicRanges object.

Other Steps

Additional steps performed during preparation of the annotations are to ensure that all external features, treatment groups and defined genomic regions have colours assigned, which will then propagate through the workflow for consistency of visualisation. External features and/or HiC data is summarised if provided and the association between these datasets and defined genomic regions is also provided as part of the output.

Greylisted regions are also determined and prepared for exclusion through the workflow, along with the provided blacklisted regions.

2.2 Peak Calling

This section of the workflow uses macs2 callpeak(Zhang et al. 2008) with default parameters. Peaks are called on each individual sample and by merging samples within treatment groups for each provided ChIP target. One summary report will be produced each ChIP target specified in the target column of samples.tsv, which will assess all treatments within a ChIP target.

Quality Assessment

Basic QC statistics such as Library Size, Read Lengths, Total Detected Peaks and Fraction Of Reads In Peaks (FRIP) are provided in tabular and visual form. Plots showing sample-specific GC content and Cross Correlations are also provided to enable visual identification of any outlier samples which can be excluded manually, or handled in any other suitable manner.

One feature of the GRAVI workflow is to compare between replicates within a treatment group to determine any low-quality samples. Taking the median number of peaks within a sample group (\(np_i\)), and sample with \(>N\)-fold or \(<N\)-fold \(np_i\) peaks will be marked as an outlier. By default, this is set to 10-fold using the parameter outlier_threshold (Section 5.1) and can be disabled by setting this parameter to .inf, which will be passed to R as the value \(\infty\). Given that some ChIP targets may be cytoplasmic and yield zero peaks in some treatment groups, samples with zero peaks can also be allowed by setting allow_zero to be true.

Results

Individual Replicates: UpSet plots(Conway, Lex, and Gehlenborg 2017) of all samples are first produced to enable identification of any possible mislabelling. Venn Diagrams are then produced for all replicates within each treatment groups (if \(n \leq3\)), or UpSet plots are produced for \(n \geq4\) replicates to show consistency between replicates.

Treatment Peaks: During this step of the workflow a set of treatment-specific peaks will be produced for each ChIP target. The set of peaks obtained by merging samples will be compared to all individual samples, and only those peaks identified in at least \(100*p\%\) of the samples which passed QC will be included in the set of Treatment peaks. This parameter can be set in config.yml as the parameter min_prop_reps (Section 5.1) with the default of 0.5 indicating that peaks must be detected in 1 or more samples (if N = 2), two or more samples (if N = 3, 4) etc.

Union Peaks: After defining the treatment-specific peaks, these are combined to define the set of union peaks. Treatment peaks are merged across conditions in an inclusive manner, such that the range of each union peak encompasses the entire region covered by all overlapping treatment peaks. Union peaks are then used as the ‘universe’ of peaks against which treatment-specific behaviours are compared.

Peaks overlapping a blacklisted or greylisted region are excluded at all steps of the analysis.

The final HTML report produced for each target will contain:

A Venn Diagram showing the overlap in Union peaks and whether one or more Treatment peaks overlaps the Union peak. In the case of 4 or more treatments, an UpSet plot will be produced instead
Distance to TSS plots as actual and cumulative distributions using union peaks.
A pie chart describing the distribution of union peaks within the genomic regions defined previously

Additional plots will be produced in the presence of external data

The distribution of union peaks within external features will be shown if external features are provided
The distribution of union peaks within external features and genomic regions will be shown if external features are provided

Highly-ranked union and treatment-specific Treatment peaks are also shown in their genomic context alongside regulatory/external features and nearby gene models.

Key Outputs

Union peaks will be exported as output/macs2/<target>/<target>_union_peaks.bed
Treatment peaks will be exported as individual bed files for each treatment group (output/macs2/<Ctarget>/<target>_<treat>_treatment_peaks.bed), as well as combined in a single RDS object for faster parsing into an R environment as output/macs2/<target>/<target>_treatment_peaks.rds
BigWig files will be generated from the bedgraph files output by mcas2 callpeak using the --SPMR tag, for each individual sample and for each set of peaks called by merging samples from the same group
Additional BigWig files will be produced from the bedgraph files output by macs2 bdgcmp, representing fold-enrichment over the input sample for each set of grouped samples

2.3 Differential Signal

This step provides much of the uniqueness to the GRAVI workflow combining approaches from macs2(Zhang et al. 2008), qsmooth(Hicks et al. 2018), csaw(Aaron T. L. Lun and Smyth 2014), limma(Ritchie et al. 2015), edgeR(Robinson, McCarthy, and Smyth 2010) and ihw(Ignatiadis et al. 2016) whilst relying heavily on the infrastructure provided by extraChIPs. A sliding window approach as advocated by Lun & Smyth (2014) is the primary strategy used for detection of differetial signal.

All two-factor comparisons of interest can specified via the YAML file (Section 5.1) with more complicated layouts specified as below

contrasts:
  - ["control", "treat1"]
  - ["control", "treat2"]

Differential signal analysis will be performed for every ChIP target where both treatment groups are found from each specified contrast. Two possible approaches are available, with these being 1) sq-lt (Smooth-Quantile limma-trend) or 2) ls-ql (Library-Size Quasi-Likelihood).

Figure 2.2: Overview of steps in the differential signal workflow. Primary R packages used for each step are indicated in brackets. Integration with differential expression results (RNA-Seq) is an option step only performed if RNA-Seq data is provided.

Sliding Windows

By default, sliding windows will be defined based on the estimated fragment length so that the window size is slightly wider than the fragment length (\(w_\text{size}\)), and the step size is \(w_\text{step} = w_\text{size} / 3\). This usually leads to window sizes which are multiples of 30nt, e.g. 90, 120, 150, 180, 210, 240 etc.

Alignments are initially counted across autosomes and sex chromosomes, explicitly excluding scaffolds and mitochondrial alignments, as well as regions contained in black/greylists. Windows are discarded if the total number of alignments across all samples are \(<n\), where \(n\) represents the total number of samples in the current analysis. Windows which are more likely to contain noise than true signal are then discarded using extraChIPs::dualFilter() in combination with the Union peaks from Section 2.2. The set of Union peaks is used as a guide for setting the inclusion/exclusion thresholds which are based on 1) Overall signal intensity and 2) Enrichment over input. Thresholds for each measure are defined such that the proportion \(q\) of windows which overlap a Union peak are returned, with windows which pass both inclusion thresholds are returned. This parameter itself is set in config.yml as filter_q as described in Section 5.1. In general, values in the range \(0.4 < q < 0.6\) perform well as the sliding windows around the margins of the Union peaks will be discarded at this point. Higher values will lead to large numbers of windows being retained which overlap peak margins and are relatively uninformative.

Normalisation

Under the sq-lt method, logCPM values are first calculated using the total library size for each sample. These values are then normalised using Smooth Quantile Normalisation(Hicks et al. 2018) specifiying treatment groups as the factor of interest and allowing for normalisation within and between groups. This can be particularly useful in the scenario where a transcription factor (TF) is primarily cytoplasmic in one treatment group, before shifting to the nucleus in response to treatment. Plots showing qsmooth weights, pre/post logCPM, pre/post RLE (Gandolfo and Speed 2018) and pre/post PCA are produced in this section of the workflow.

Alternatively, under the ls-ql methodology, only the total library size is provided to the GLM fitting stage with no further normalisation strategies being employed. This is analogous to the default edgeR option provided using DiffBind (Stark and Brown 2011).

Hypothesis Testing

Given the normalised logCPM values being analysed under the sq-lt approach, an appropriate strategy for detection of differential signal is the limma-trend method (Law et al. 2014) as established for RNA-Seq analysis. Under the ls-ql approach, raw counts are analysed using the Quasi-Likelihood approach for Negative Binomial data (Aaron T. L. Lun, Chen, and Smyth 2016), again as developed for RNA-Seq data, and as offered when using DiffBind (Stark and Brown 2011).

Both strategies are performed on all retained windows using a range-based Null Hypothesis (McCarthy and Smyth 2009), such that a minimum change can be specified, below which changes in signal are not considered to be of interest. Under this statistical approach, the Null Hypothesis would be

\[ H_0: -\lambda \leq \mu \leq \lambda \]

with the alternate hypothesis being

\[ H_A: |\mu| > \lambda \]

By default, the value \(\lambda = \log_2(1.2)\) is used, which denotes a 20% change in signal as the point where sites become of interest. This value is set in config.yml as the parameter fc (Default: fc = 1.2), and setting fc = 1 would return the statistical approach to be a point-based Null Hypothesis \(H_0: \mu = 0\).

After performing this statistical test, overlapping windows are merged taking the individual window with the highest signal as the representative window for the merged region. As this value is independent of the statistical test(Aaron T. L. Lun and Smyth 2014), the resultant set of p-values are FDR-adjusted using the Benajmini-Hochberg approach(Benjamini and Hochberg 1995), giving a set of FDR-adjusted p-values for all merged regions.

Independent Hypothesis Weighting

The Independent Hypothesis Weighting approach(Ignatiadis et al. 2016) suggests the a set of p-values can be partitioned by any independent variable, with weights assigned to each partition, and these weighted p-values can then be adjusted using conventional strategies, such as the FDR. Under the GRAVI workflow, four possibilities for partitioning the p-values obtained after merging regions are provided, and these can be specified in the ihw parameter of config.yml. The options are

ihw: "targets" where union peaks from all other ChIP targets included in the larger GRAVI workflow are used to partition p-values. As union peaks are treatment-agnostic, these simply provide a scaffold defining the presence/absence of all other ChIP targets under any treatment condition, in combination across all ChIP targets.
ihw: "regions" where previously defined genomic regions are used to partition the p-values
ihw: "features" where any external features supplied are used for the partitioning
ihw: "none" where no partitioning is performed and the standard FDR-adjusted p-values are used to determine differential signal status

The default approach implemented by the IHW authors suggests that p-value partitions must be > 1000, and as such, all the above approaches collapse smaller groups until the smallest group contains at least 1000 p-values (i.e. merged regions). In the case of ChIP targets which do not bind in a promiscuous manner, the IHW step may make minimal difference to the results.

Assigning Genes To Windows

After merging of neighbouring sliding windows, genes are assigned to each merged region. This is performed using annotated genomic regions, any external features and HiC data, via the function extraChIPs::mapByFeatures().

Under this approach:

Regions which overlap a promoter are assigned to the genes associated directly with that promoter
Regions which overlap an enhancer are assigned to all genes within 100kb
Regions which overlap any HiC interactions are assigned to all genes connected by the interactions
Regions with no assignment from steps 1-3 are assigned to all directly overlapping genes, or the nearest gene within 100kb

If no HiC data is included, step 3 is not performed

Presentation of Key Results

The above steps essentially complete the detection of differential signal regions and assignment of these to regulatory target genes. A series of visualisations are then provided including:

MA & Volcano plots
Profile Heatmaps for sites with Increased/Decreased target signal
Results summarised by chromosome
Signal/logFC/differential signal partitioned by genomic region

If external features are provided, the plots from step 4 are replicated using external features to partition results. A combined summary of the results by genomic region and external features is also included.

A series of summary tables, along with the 200 most highly ranked regions are included. Results of differential signal windows and the genes assigned to them are also exported as a CSV for sharing with collaborators. Genomic plots for unique regions based on signal strength (logCPM), changed signal (logFC) and statistical support (FDR) are also provided as part of the default output.

Enrichment Testing

Using the gene-sets and pathways specified in params.yml four enrichment analyses are performed using goseq(Young et al. 2010). These are

Comparison of genes mapped to a site with target signal and genes which are not mapped to a region with target signal
Comparison of genes mapped to a differential signal site against genes mapped to a site with detected signal, but which is considered unchanged
Comparison of genes mapped to a site with increased signal against genes mapped to a site, but with no differential signal
Comparison of genes mapped to a site with decreased signal against genes mapped to a site, but with no differential signal

Gene-width is used as an offset for biased sampling in the analysis of bound target only, as longer genes are more likely to have a peak mapped to them. For all other analyses, no offset term is included and the standard Hypergeometric test is performed.

All results are presented as searchable, interactive HTML tables including which genes from each pathway are mapped to the relevant set of regions from which signal was detected. Where enough pathways are detected as significantly enriched, network plots are also produced deriving distances between nodes based on shared, differential-signal genes using the overlap coefficient.

RNA-Seq Data

If results from a differential expression (DE) analysis are included (which can also be microarray results), a series of additional analyses are performed. Firstly, the relationship between detected genes and the genomic regions where target signal has been detected are characterised. If external features are also provided, this is repeated incorporating these features.

P-values from DE analysis are partitioned by differential signal status. No further IHW is performed, but this can still provide a clear visual clue as to the relationship between target signal and differential gene expression.

Genomic regions for the 5 genes most highly ranked for differential expression, and with \(>1\) target-bound window assigned to them are plotted.

Given the unpredictability of an RNA-Seq dataset and the number of genes considered to be differentially expressed, bound and differential signal windows are used as gene sets to perform GSEA(Korotkevich, Sukhov, and Sergushichev 2019) using sites directly, and sites partitioned by genomic region and external feature (if provided). GSEA is performed 1) taking direction of differential expression into account, 2) ranking genes purely by significance (i.e. without direction).

Finally, standalone GSEA results from the RNA-seq dataset (incorporating direction of change and all requested MSigDB pathways) are compared to enrichment (i.e. goseq) results for differential signal regions. As these are essentially orthogonal viewpoints on gene regulation, P-values from each analysis are combined using Wilkinson’s maximump() method and the resultant set of p-values adjusted using the specified method and requested threshold for \(\alpha\) Any common pathways are given in an interactive HTML table along with network plots based on shared regulatory targets also found in the leading edge fro GSEA.

Key Outputs

Key output files produced by this step of the workflow are:

Differential signal results (by window): output/differential_binding/<target>/<target>_<control>_<treat>_differential_binding.rds
Differential signal results (by gene): output/differential_binding/<target>/<target>_<control>_<treat>_differential_binding.csv.gz
Regions with increased target signal: output/differential_binding/<target>/<target>_<control>_<treat>_up.bed
Regions with decreased target signal: output/differential_binding/<target>/<target>_<control>_<treat>_down.bed

2.4 Pairwise Comparisons

By default, pair-wise comparisons are performed between all differential signal results. If ChIP target TF1 has samples in Control and Treat1, whilst ChIP target TF2 has samples in Control, Treat1 and Treat2, with comparisons being requested for Treat1 vs. Control and Treat2 Vs Control, the following pair-wise comparisons will be automatically performed:

TF1 (Treat1 Vs Control) and TF2 (Treat1 Vs Control)
TF1 (Treat1 Vs Control) and TF2 (Treat2 Vs Control)
TF2 (Treat1 Vs Control) and TF2 (Treat2 Vs Control)

Given the automated nature of the workflow, this represents the simplest approach and any redundant comparisons are simply able to be ignored by the user.

Comparison of Peaks

The pair-wise comparisons module initially compares union peaks between the two targets as a Venn Diagram, with the sets of (up to) four treatment-specific peaks being compared using an UpSet plot.

Comparison of Differentially Bound Windows

Pair-wise comparison of two ChIP targets requires more nuance than simply looking for sites where both are changed. A universal set of windows is first obtained across all windows retained in both targets. These are then classified as either Up, Down, Unchanged or Undetected for each ChIP target. Using both targets, the universal windows are then classified based on both targets. This is particularly important given the hypothesised role of ChIP targets which can act as pioneer factors(Zaret and Carroll 2011), or those which bind in complex and sequester other factors(Hickey et al. 2021).

The classification of each window is initially based on significant FDR-adjusted p-value using the range-based H₀ in at least one comparison. In order to ensure more accurate assignment of windows in the secondary comparison, the FDR-adjusted p-values using a point-based H₀ are used, in conjunction with an estimated logFC beyond the range \(\pm\lambda\), i.e. \(|\widehat{\text{logFC}}| > \lambda\). This reduces the number of regions incorrectly classified as unchanged in one comparison due to the use of the range-based H₀, as is common for all situations where threshold are applied.

The combined behaviours of both ChIP targets is then compared directly, described by genomic region and external features (if provided). Distances between the windows with representative statistics (i.e. maximal signal) are determined where both targets are present. The combined changes in signal are then compared as a complete set for all windows where both targets are detected, as well as broken down by genomic region and external features. Distances between the windows where the maximal signal was observed are also presented

Combined Visualisations

Profile heatmaps for all behavioural groups which are detected (e.g. TF1 Up - TF2 Up, TF1 Up - TF2 - Unchanged etc) are all presented.

The genomic windows for which the two factors are present are visualised based on the combined strongest signal, and the combined largest change. Using only combinations of Up/Down/Unchanged, six windows for each are presented. As with all genomic visualisations, any coverage provided as an external track (e.g H3K27ac or ATAC-seq signal) will be added to all plots.

Enrichment Analysis

The same gene-sets from MSigDB(Liberzon et al. 2015) as used previously are then used for enrichment testing, using genes as mapped to windows during previous steps. Enrichment testing again uses goseq with gene length as the biased-sampling term. Enrichment is performed at multiple levels:

Genes Mapped To All Windows
1. Genes mapped in either comparison
2. Genes mapped in comparison 1 but not the second comparison
3. Genes mapped in comparison 2 but not the first comparison
4. Genes mapped in both comparisons
Genes Mapped to Differential Signal Windows
- All pair-wise combinations of Up/Down/Unchanged/Undetected are tested across both factors. If no enrichment is found, results are not presented.

Where enrichment for pathways are found, network plots are again generated

Integration With RNA-Seq Results

The set of genes mapped to each pair-wise combination of Up/Down/Unchanged/Undetected are then compared to the external DE results using GSEA incorporating direction of change, and overall significance. Differential expression is displayed as a multi-panel volcano plot, separating genes by detected signal patterns. Along with barcode plots for the 9 most highly-ranked groups, genes in the Leading Edge for all significant combined groups are also provided in the results table, along with network plots. Significantly enriched pathways are again determined by combining p-values from GSEA performed on RNA-Seq and enrichment testing performed sites returning signal, using Wilkinson’s method.

2.4.1 Key Outputs

Key output files produced by this step of the workflow include

Pairwise differential signal patterns across both comparisons (output/pairwise_comparisons/<target1>_<target2>/<target1>_<comparison1>-<target2>_<comparison2>-pairwise_comparison.csv.gz)
The above file as an RDS object for easier parsing into an R environment.

If RNA-Seq data is provided, the file output/pairwise_comparisons/<target1>_<target2>/<target1>_<comparison1>-<target2>_<comparison2>-de_genes.csv will contain mappings between DE genes and combined ChIP target differential signal patterns. All lists of enriched pathways will also be exported as csv files.

References

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2022. Rmarkdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological) 57 (1): 289–300. http://www.jstor.org/stable/2346101.

Blischak, John D, Peter Carbonetto, and Matthew Stephens. 2019. “Creating and Sharing Reproducible Research Code the Workflowr Way [Version 1; Peer Review: 3 Approved].” F1000Research 8 (1749). https://doi.org/10.12688/f1000research.20843.1.

Conway, Jake R., Alexander Lex, and Nils Gehlenborg. 2017. “UpSetR: An r Package for the Visualization of Intersecting Sets and Their Properties.” Bioinformatics 33 (18): 2938–40. https://doi.org/10.1093/bioinformatics/btx364.

Gandolfo, L. C., and T. P. Speed. 2018. “RLE plots: Visualizing unwanted variation in high dimensional data.” PLoS One 13 (2): e0191629.

Hickey, T. E., L. A. Selth, K. M. Chia, G. Laven-Law, H. H. Milioli, D. Roden, S. Jindal, et al. 2021. “The androgen receptor is a tumor suppressor in estrogen receptor-positive breast cancer.” Nat Med 27 (2): 310–20.

Hicks, Stephanie C., Kwame Okrah, Joseph N. Paulson, John Quackenbush, Rafael A. Irizarry, and Hector Corrado Bravo. 2018. “Smooth Quantile Normalization.” Biostatistics 19 (2).

Ignatiadis, N., B. Klaus, J. B. Zaugg, and W. Huber. 2016. “Data-driven hypothesis weighting increases detection power in genome-scale multiple testing.” Nat Methods 13 (7): 577–80.

Korotkevich, Gennady, Vladimir Sukhov, and Alexey Sergushichev. 2019. “Fast Gene Set Enrichment Analysis.” bioRxiv. https://doi.org/10.1101/060012.

Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biol 15 (2): R29.

Liberzon, A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, J. P. Mesirov, and P. Tamayo. 2015. “The Molecular Signatures Database (MSigDB) hallmark gene set collection.” Cell Syst 1 (6): 417–25.

Lun, Aaron T L, and Gordon K Smyth. 2014. “De Novo Detection of Differentially Bound Regions for ChIP-Seq Data Using Peaks and Windows: Controlling Error Rates Correctly.” Nucleic Acids Res. 42 (11): e95.

Lun, Aaron T. L., Yunshun Chen, and Gordon K. Smyth. 2016. “It’s DE-Licious: A Recipe for Differential Expression Analyses of RNA-Seq Experiments Using Quasi-Likelihood Methods in edgeR.” In Statistical Genomics: Methods and Protocols, edited by Ewy Mathé and Sean Davis, 391–416. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4939-3578-9_19.

McCarthy, Davis J., and Gordon K. Smyth. 2009. “Testing significance relative to a fold-change threshold is a TREAT.” Bioinformatics 25 (6): 765–71. https://doi.org/10.1093/bioinformatics/btp053.

Mölder, F., K. P. Jablonski, B. Letcher, M. B. Hall, C. H. Tomkins-Tinch, V. Sochat, J. Forster, et al. 2021. “Sustainable data analysis with Snakemake.” F1000Res 10: 33.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Stark, Rory, and Gordon Brown. 2011. DiffBind: Differential Binding Analysis of ChIP-Seq Peak Data. http://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf.

Young, Matthew D, Matthew J Wakefield, Gordon K Smyth, and Alicia Oshlack. 2010. “Gene Ontology Analysis for RNA-Seq: Accounting for Selection Bias.” Genome Biology 11: R14.

Zaret, K. S., and J. S. Carroll. 2011. “Pioneer transcription factors: establishing competence for gene expression.” Genes Dev 25 (21): 2227–41.

Zhang, Y., T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nusbaum, et al. 2008. “Model-based analysis of ChIP-Seq (MACS).” Genome Biol 9 (9): R137.