Chapter 4 Files and Directories

4.1 Directory Structure

The GRAVI workflow requires a set directory structure. If using the template repository, as advised, this will be mostly taken care of. The required directory structure is

project_home/
 ├── analysis
 ├── config
 ├── data
 ├── docs
 ├── output
 └── workflow
  • Rmarkdown scripts will be added to and executed from the analysis directory
  • Key configuration files are provided in the config directory
  • Your data should be placed in the data directory as described below
  • The html output summarising all results will be produced in the docs directory
  • Additional output files will be placed in output with all figures written within the respective folder for each html page, or with docs/assets.
  • The workflow itself is run by all code supplied in the workflow directory
  • The complete R Environment for each compiled RMarkdown document is also saved in output/envs, and given their larger sizes, these can be deleted using snakemake --delete-temp-output --cores 1 to conserve storage space.

4.2 Alignments

The GRAVI workflow currently takes bam files as the primary input. Multiple workflows exist for quality control, adapter removal and de-duplication and it is assumed that supplied reads will have been pre-processed with the above steps, then aligned to the genome of interest.

Files should be placed in the data/bam directory as set in config.yml, although this can be changed if desired.

project_home/
 └── data
      └── bam 
           ├── target1_control_rep1.bam
           ├── target1_control_rep2.bam
           ├── target1_control_rep3.bam
           ├── target1_treat_rep1.bam
           ├── target1_treat_rep2.bam
           ├── target1_treat_rep3.bam
           ├── target2_control_rep1.bam
           ├── target2_control_rep2.bam
           ├── target2_control_rep3.bam
           ├── target2_treat_rep1.bam
           ├── target2_treat_rep2.bam
           ├── target2_treat_rep3.bam
           └── input1.bam 

4.3 Sample Descriptions

The file samples.tsv defines the set of files which the workflow will be applied to. Any files placed in the data/aligned directory, but not specified in this file will be ignored. The desired layout should be a tab-delimited file (i.e. tsv). These can be generated using Excel, Notepad++, R, Visual Studio, or any other software you are comfortable with. A brief example would follow the layout

sample target treat replicate input
target1_control_rep1 Target1 Control 1 input1
target1_control_rep2 Target1 Control 2 input1
target1_control_rep3 Target1 Control 3 input1
target1_treat_rep1 Target1 Treat1 1 input1
target1_treat_rep2 Target1 Treat1 2 input1
target1_treat_rep3 Target1 Treat1 3 input1

4.3.1 Required columns

This file must contain all four of the columns sample, target, treat, input, in any order. If supplied, optional columns such as replicate, passage etc can be referenced in the workflow. As well as defining all required steps for the workflow, labels for plots will be generated from combinations of these columns.

  • sample: This must be identical to the filename, but without the .bam extension.
  • treat: This is used to define all comparisons
  • input: All files must correspond to a file in data/bam but without the .bam suffix. Each sample can have a separate input sample, or input samples can be shared across all or some of the samples.

4.3.2 Optional Columns

Any additional columns can be used to denote batches, or passages if running a nested/paired model. These column names will be automatically detected at the appropriate steps of the workflow and incorporated into figures and tables. Common column names may be replicate or passage (for cell lines)

4.4 Additional Files

Additional, optional files can also be supplied and is it customary to place these in data/external with paths (relative to project_home) added to config.yml. Full details are demonstrated in Section 5.1 Names can be any informative name chosen by the user.

project_home/
 └── data
      ├── bam 
      └── external
           ├── gencode_annotation.gtf.gz
           ├── blacklist.bed.gz
           ├── rnaseq_topTable.tsv
           ├── external_features.gtf.gz
           ├── hic_interactions.bedpe
           ├── additional_coverage_control.bw
           └── additional_coverage_treat.bw

4.4.1 RNA-Seq

Files provided with differential expression analysis results from a relevant RNA-Seq experiment should follow the layout as produced by topTable() from the limma package(Ritchie et al. 2015). Gene IDs should match those in the Gencode GTF (Ensembl IDs) and should be contained in a column called gene_id. Additional expected columns will be logFC and FDR or similar names which could be reasonably found by regex matching within the workflow.

4.4.2 External Features

These must be provided as a GTF which can be prepared by any method. The feature types should be defined in a field named feature. Non-overlapping features are optimal but not essential, and this is left to the users discretion. For example, if providing features such as enhancers and super-enhancers(Whyte et al. 2013), it may be more sensible to provide these as mutually exclusive groups.

4.4.3 HiC Interactions

Significant interactions can be sourced using any methodology, however these must be provided in bedpe format.

4.4.4 External Coverage

Additional coverage files should be provided in bigwig format.

References

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Whyte, W. A., D. A. Orlando, D. Hnisz, B. J. Abraham, C. Y. Lin, M. H. Kagey, P. B. Rahl, T. I. Lee, and R. A. Young. 2013. Master transcription factors and mediator establish super-enhancers at key cell identity genes.” Cell 153 (2): 307–19.