Chapter 4 Files and Directories
4.1 Directory Structure
The GRAVI workflow requires a set directory structure. If using the template repository, as advised, this will be mostly taken care of. The required directory structure is
project_home/
├── analysis
├── config
├── data
├── docs
├── output
└── workflow
- Rmarkdown scripts will be added to and executed from the
analysis
directory - Key configuration files are provided in the
config
directory - Your data should be placed in the
data
directory as described below - The
html
output summarising all results will be produced in thedocs
directory - Additional output files will be placed in
output
with all figures written within the respective folder for each html page, or withdocs/assets
. - The workflow itself is run by all code supplied in the
workflow
directory - The complete
R Environment
for each compiled RMarkdown document is also saved inoutput/envs
, and given their larger sizes, these can be deleted usingsnakemake --delete-temp-output --cores 1
to conserve storage space.
4.2 Alignments
The GRAVI workflow currently takes bam
files as the primary input.
Multiple workflows exist for quality control, adapter removal and de-duplication and it is assumed that supplied reads will have been pre-processed with the above steps, then aligned to the genome of interest.
Files should be placed in the data/bam
directory as set in config.yml
, although this can be changed if desired.
project_home/
└── data
└── bam
├── target1_control_rep1.bam
├── target1_control_rep2.bam
├── target1_control_rep3.bam
├── target1_treat_rep1.bam
├── target1_treat_rep2.bam
├── target1_treat_rep3.bam
├── target2_control_rep1.bam
├── target2_control_rep2.bam
├── target2_control_rep3.bam
├── target2_treat_rep1.bam
├── target2_treat_rep2.bam
├── target2_treat_rep3.bam
└── input1.bam
4.3 Sample Descriptions
The file samples.tsv
defines the set of files which the workflow will be applied to.
Any files placed in the data/aligned
directory, but not specified in this file will be ignored.
The desired layout should be a tab-delimited file (i.e. tsv).
These can be generated using Excel, Notepad++, R, Visual Studio, or any other software you are comfortable with.
A brief example would follow the layout
sample | target | treat | replicate | input |
---|---|---|---|---|
target1_control_rep1 | Target1 | Control | 1 | input1 |
target1_control_rep2 | Target1 | Control | 2 | input1 |
target1_control_rep3 | Target1 | Control | 3 | input1 |
target1_treat_rep1 | Target1 | Treat1 | 1 | input1 |
target1_treat_rep2 | Target1 | Treat1 | 2 | input1 |
target1_treat_rep3 | Target1 | Treat1 | 3 | input1 |
4.3.1 Required columns
This file must contain all four of the columns sample
, target
, treat
, input
, in any order.
If supplied, optional columns such as replicate
, passage
etc can be referenced in the workflow.
As well as defining all required steps for the workflow, labels for plots will be generated from combinations of these columns.
sample
: This must be identical to the filename, but without the.bam
extension.treat
: This is used to define all comparisonsinput
: All files must correspond to a file indata/bam
but without the.bam
suffix. Each sample can have a separate input sample, or input samples can be shared across all or some of the samples.
4.3.2 Optional Columns
Any additional columns can be used to denote batches, or passages if running a nested/paired model.
These column names will be automatically detected at the appropriate steps of the workflow and incorporated into figures and tables.
Common column names may be replicate
or passage
(for cell lines)
4.4 Additional Files
Additional, optional files can also be supplied and is it customary to place these in data/external
with paths (relative to project_home
) added to config.yml
.
Full details are demonstrated in Section 5.1
Names can be any informative name chosen by the user.
project_home/
└── data
├── bam
└── external
├── gencode_annotation.gtf.gz
├── blacklist.bed.gz
├── rnaseq_topTable.tsv
├── external_features.gtf.gz
├── hic_interactions.bedpe
├── additional_coverage_control.bw
└── additional_coverage_treat.bw
4.4.1 RNA-Seq
Files provided with differential expression analysis results from a relevant RNA-Seq experiment should follow the layout as produced by topTable()
from the limma
package(Ritchie et al. 2015).
Gene IDs should match those in the Gencode GTF (Ensembl IDs) and should be contained in a column called gene_id
.
Additional expected columns will be logFC
and FDR
or similar names which could be reasonably found by regex
matching within the workflow.
4.4.2 External Features
These must be provided as a GTF which can be prepared by any method.
The feature types should be defined in a field named feature
.
Non-overlapping features are optimal but not essential, and this is left to the users discretion.
For example, if providing features such as enhancers and super-enhancers(Whyte et al. 2013), it may be more sensible to provide these as mutually exclusive groups.