Chapter 4 Files and Directories
4.1 Directory Structure
The GRAVI workflow requires a set directory structure. If using the template repository, as advised, this will be mostly taken care of. The required directory structure is
project_home/
├── analysis
├── config
├── data
├── docs
├── output
└── workflow
- Rmarkdown scripts will be added to and executed from the
analysisdirectory - Key configuration files are provided in the
configdirectory - Your data should be placed in the
datadirectory as described below - The
htmloutput summarising all results will be produced in thedocsdirectory - Additional output files will be placed in
outputwith all figures written within the respective folder for each html page, or withdocs/assets. - The workflow itself is run by all code supplied in the
workflowdirectory - The complete
R Environmentfor each compiled RMarkdown document is also saved inoutput/envs, and given their larger sizes, these can be deleted usingsnakemake --delete-temp-output --cores 1to conserve storage space.
4.2 Alignments
The GRAVI workflow currently takes bam files as the primary input.
Multiple workflows exist for quality control, adapter removal and de-duplication and it is assumed that supplied reads will have been pre-processed with the above steps, then aligned to the genome of interest.
Files should be placed in the data/bam directory as set in config.yml, although this can be changed if desired.
project_home/
└── data
└── bam
├── target1_control_rep1.bam
├── target1_control_rep2.bam
├── target1_control_rep3.bam
├── target1_treat_rep1.bam
├── target1_treat_rep2.bam
├── target1_treat_rep3.bam
├── target2_control_rep1.bam
├── target2_control_rep2.bam
├── target2_control_rep3.bam
├── target2_treat_rep1.bam
├── target2_treat_rep2.bam
├── target2_treat_rep3.bam
└── input1.bam
4.3 Sample Descriptions
The file samples.tsv defines the set of files which the workflow will be applied to.
Any files placed in the data/aligned directory, but not specified in this file will be ignored.
The desired layout should be a tab-delimited file (i.e. tsv).
These can be generated using Excel, Notepad++, R, Visual Studio, or any other software you are comfortable with.
A brief example would follow the layout
| sample | target | treat | replicate | input |
|---|---|---|---|---|
| target1_control_rep1 | Target1 | Control | 1 | input1 |
| target1_control_rep2 | Target1 | Control | 2 | input1 |
| target1_control_rep3 | Target1 | Control | 3 | input1 |
| target1_treat_rep1 | Target1 | Treat1 | 1 | input1 |
| target1_treat_rep2 | Target1 | Treat1 | 2 | input1 |
| target1_treat_rep3 | Target1 | Treat1 | 3 | input1 |
4.3.1 Required columns
This file must contain all four of the columns sample, target, treat, input, in any order.
If supplied, optional columns such as replicate, passage etc can be referenced in the workflow.
As well as defining all required steps for the workflow, labels for plots will be generated from combinations of these columns.
sample: This must be identical to the filename, but without the.bamextension.treat: This is used to define all comparisonsinput: All files must correspond to a file indata/bambut without the.bamsuffix. Each sample can have a separate input sample, or input samples can be shared across all or some of the samples.
4.3.2 Optional Columns
Any additional columns can be used to denote batches, or passages if running a nested/paired model.
These column names will be automatically detected at the appropriate steps of the workflow and incorporated into figures and tables.
Common column names may be replicate or passage (for cell lines)
4.4 Additional Files
Additional, optional files can also be supplied and is it customary to place these in data/external with paths (relative to project_home) added to config.yml.
Full details are demonstrated in Section 5.1
Names can be any informative name chosen by the user.
project_home/
└── data
├── bam
└── external
├── gencode_annotation.gtf.gz
├── blacklist.bed.gz
├── rnaseq_topTable.tsv
├── external_features.gtf.gz
├── hic_interactions.bedpe
├── additional_coverage_control.bw
└── additional_coverage_treat.bw
4.4.1 RNA-Seq
Files provided with differential expression analysis results from a relevant RNA-Seq experiment should follow the layout as produced by topTable() from the limma package(Ritchie et al. 2015).
Gene IDs should match those in the Gencode GTF (Ensembl IDs) and should be contained in a column called gene_id.
Additional expected columns will be logFC and FDR or similar names which could be reasonably found by regex matching within the workflow.
4.4.2 External Features
These must be provided as a GTF which can be prepared by any method.
The feature types should be defined in a field named feature.
Non-overlapping features are optimal but not essential, and this is left to the users discretion.
For example, if providing features such as enhancers and super-enhancers(Whyte et al. 2013), it may be more sensible to provide these as mutually exclusive groups.