Chapter 5 YAML Configuration Files
All YAML files which ruin the workflow are located in the project_home/config
directory.
The standard YAML structure is used in all files with the primary objective being passing workflow parameters to the various steps of the workflow.
There are four files which control various aspects: config.yml
, colours.yml
, params.yml
and rmarkdown.yml
5.1 The Main Configuration: config.yml
This file sets many of the primary parameters and is the file which will need editing for any new dataset.
Many settings should remain unchanged as changing default locations of files may lead to unexpected instability in the workflow, whilst other setting should be changed, such as those which determine which comparisons to perform.
This is the only file parsed directly by snakemake
and subsequent rules, whilst all others are used to pass parameters to R
environments.
An example layout of config.yml
might be:
samples:
file: "config/samples.tsv"
paths:
bam: "data/bam"
genome:
build: "GRCh37"
external:
blacklist: "data/external/blacklist.bed.gz"
gtf: "data/external/gencode.v33lift37.annotation.gtf.gz"
rnaseq: "data/external/some_results.tsv"
features: "data/external/custom_features.gtf.gz"
hic: "data/external/hic_interactions.bedpe"
coverage:
H3K27Ac:
control: "data/external/H3K27Ac_control.bw"
treat: "data/external/H3K27Ac_treat.bw"
comparisons:
method: 'sq-lt'
fc: 1.2
fdr: 0.05
paired: false
filter_q: 0.6
contrasts:
- ["control", "treat"]
ihw: "regions"
peaks:
macs2:
gsize: "hs"
fdr: 0.05
keep_duplicates: "all"
qc:
outlier_threshold: 10
allow_zero: true
min_prop_reps: 0.5
Settings Which Don’t Need To Be Modified
In general, the paths to key files don’t need to be changed and default configurations are well tested. Whilst varying these has been intermittently attempted successfully, unexpected instability may occur and as such, is discouraged.
samples:
(Default: “config/samples.tsv”)
By default, the file which contains all sample-level information is defined as config/samples.tsv
.
This can be changed to reflect any requirements after inspecting QC reports.
paths:
bam:
(Default: “data/bam”) Alignments should be placed indata/bam
as advised in section 4.2, although this can be changed to any other path as desired.
Settings Which Should Be Modified
genome:
Specify the genome build used for alignments and for gene annotations.
external:
Provide paths to all additional data files here. Only files provided will be included in the workflow.
blacklist
Should be obtained from heregtf
: Should ideally be obtained from Gencode- (Optional)
rnaseq
should be the results as output bylimma::topTable()
or similar. Gene IDs should match those provided in the Gencode GTF (e.g. Ensembl IDs) and these should be in a column namesgene_id
. Columns such aslogFC
andFDR
will be searched for during the workflow using regular expressions to find the best match. Can only be incsv
ortsv
format. Excel-specific (xls
,xlsx
) formats are not supported. - (Optional)
features
can be determined by any method, with common choices being relevant histone marks, or promoters, enhancers and super-enhancers determined by H3K27ac marks. Features should be non-overlapping with the fieldfeature
defining the different feature types. Must be provided in GTF format. - (Optional)
hic
HiC interactions must be provided as abedpe
file - (Optional)
coverage
Tracks provided in this argument will be added to all genomic plots showing binding peaks or differential binding. Multiple files provided within each YAML list element will be overlaid as a single track. There is no upper limit to the number of tracks, however more tracks generally detract from an informative figure.
comparisons:
These settings determine how the differential binding analysis is performed.
method
(Default: “sq-lt”) Defines the strategy used for differential binding. The currently implemented options are “sq-lt” and “ls-ql” as described previouslyfc
(Default: 1.2) The setting of 1.2 indicates a 20% change in binding as the threshold below which we are not interested, or below which we consider binding changes to be inconsequential. This parameters is passed tolimma::treat()
(McCarthy and Smyth 2009) in all differential binding analyses.fdr
(Default: 0.05) Windows with significance below this threshold are considered to provide supporting evidence of differential binding.paired
(Default:false
) If set totrue
the values in the optional column (e.g. replicate, passage etc.) are used to perform a paired analysis as described in thelimma
manualfilter_q
(Default: 0.6) Passed toextraChIPs::dualFilter()
. When filtering (i.e. discarding) genomic sliding windows which are unlikely to contain true binding signal, determine thresholds which will retain this proportion of windows which overlap a peak identified bymacs2 callpeak
.contrasts
: Define all contrasts for differential binding. Any ChIP target containing both treatment groups will be included for differential binding. Values must match those in thetreat
column ofsamples.tsv
. Each differential binding analysis will be performed using the limma-trend method in the context of A vs B, such that complex models are not supported. Use new YAML list elements to define additional contrastsihw
(Default: “regions”) Options used to stratify p-values for Independent Hypothesis Weighting(Ignatiadis et al. 2016) of differential binding results. Can take values"regions"
,"features"
,"targets"
or"none"
"regions"
P-values will be stratified by annotated genomic regions as determined in the initial steps of the workflow"features"
P-values will be stratified by provided external features"targets"
P-values will be stratified by the presence of a union peak using all other ChIP targets"none"
No Independent Hypothesis Weighting will be performed on the results from differential binding
peaks:
macs2
settings are passed to macs2 callpeak
. Only the arguments gsize
, fdr
and keep_duplicates
are accepted.
Please see the macs2
manual for more detailed explanations.
qc
parameters are used for determining if samples are of a high enough quality, and how to determine union/treatment peaks for each target and treatment group.
outlier_threshold
(Default: 10) As previously described, samples with peak numbers beyond this value relative to the median peak numbers in each treatment group are marked for removal. To ignore this parameter, set this to.inf
which effectively sets the threshold to \(\infty\)allow_zero
(Default:true
) Allow samples with no identified peaks. If a ChIP target is expected to be cytoplasmic in one condition, this can allow samples with no peaks to be retained.min_prop_reps
(Default: 0.5) When forming treatment-specific peaks within each target and treatment group, peaks must be represented in at least this proportion of samples. This defaults to 0.5 which would equate to 2 of 3 samples, 2 of 4 samples etc. This value may need to be altered pending the results of a complete run after Quality Assessment has performed.
5.2 Colour Schemes: colours.yml
[#colours-yml]
Defines all plotting colour schemes for consistency throughout the entire workflow.
Colours can be any standard colours able to be interpreted by R
, such as 'blue'
or '#0000FF'
.
Recommended YAML list elements are qc
, regions
, direction
and treat
.
A gradient through al colours provided in the heatmaps
element will be generated during generation of heatmaps.
Any unspecified colours will be automatically assigned and will propagate through the workflow.
As is standard across most programming languages, names are case-sensitive.
An example file is given below:
qc:
pass: "#0571B0"
fail: "#CA0020"
regions:
promoter: '#FF3300'
upstream_promoter: '#E1EE05'
exon: '#7EDD57'
intron: '#006600'
proximal_intergenic: '#000066'
distal_intergenic: '#551A8B'
treat:
Input: "#33333380"
E2: "#3333CC"
E2DHT: "#C52240"
features:
enhancer: "#FFFF00"
no_feature: "#E5E5E5"
direction:
up: "#CA0020"
down: "#0571B0"
unchanged: "#7F7F7F"
undetected: "#E5E5E5"
heatmaps: ['white', 'red']
5.3 Additional Parameters: params.yml
Default settings for defining initial annotations, mapping of peaks to genes and enrichment testing. In general, these will not need to be changed, but can be if required. In particular, the value enh2gene can be set to zero if using data such as H3K27ac HiChIP to determine long-range interactions.
For enrichment testing, the preferred p-value adjustment method can be specified using any value that match p.adjust.methods
with an appropriate threshold also set in this section.
Default settings are relatively inclusive and more stringent setting may be preferred by some users.
Parameters for msigdb
(Liberzon et al. 2015) are passed to msigdbr
(Dolgalev 2022) and fields should match this layout.
Any categories passed to gs_cat
will lead to all subcategories being used from that category.
Specific sub-categories of larger databases can be passed using gs_subcat
element.
Values which determine the minimum and maximum size of any network plots are also defined here. Node-pair distances above the given distance will be removed from the plot and these edges excluded.
gene_regions:
promoters:
upstream: 1500
downstream: 500
upstream: 5000
intergenic: 10000
## The values used when mapping peaks to genes.
## Passed to `extraChIPs::mapByFeature()`
## If including H3K27ac HiChIP for long range-interactions, it is advised to
## set `enh2gene` as zero, given that long range interactions in this case
## will more accurately map long-range enhancer interactions than using all
## genes within a given distance
mapping:
gr2gene: 100000
prom2gene: 0
enh2gene: 100000
gi2gene: 0
enrichment:
adj: "fdr"
alpha: 0.05
## Only gene-sets between these two values will be retained before testing
min_size: 5
max_size: 1000
## Only gene-sets above this size will be shown in the results
min_sig: 3
## The categories to use from MSigDB. These are passed to msigdbr in the
## columns of the same name in an 'OR' approach
species: "Homo sapiens"
msigdb:
gs_cat: "H"
gs_subcat:
- "CP:KEGG"
- "CP:REACTOME"
- "CP:WIKIPATHWAYS"
- "TFT:GTRD"
## Used for network plots
networks:
min_size: 4
max_size: 80
max_distance: 0.9
layout: 'fr'
5.4 HTML Settings: rmarkdown.yml
The main workflow will produce a compiled set of HTML pages using rmarkdown::render_site()
(Allaire et al. 2022).
The two available fields to supply here are:
knitr_opts
which are passed toknitr::opts_chunk$set()
(Xie 2022) at the beginning of every compiled Rmarkdown document, andrmarkdown_site
which determines the layout and style of the final HTML report. All left elements of thenavbar
are determined automatically during the workflow and will be ignored if supplied here, whilst all other parameters are passed via the file_site.yaml
which will be generated during the workflow.
Figure height and width parameters must be passed to knitr::opts_set()
in inches, and the workflow defaults to svg
output for most figures.
Some figures which can become excessively large under this setting, such as those with many thousands of overlapping points, are set to always produce png
output.
Multiple output formats can also be set using values such as ['png', 'pdf']
if preferred.
knitr_opts:
echo: TRUE
message: FALSE
warning: FALSE
dev: ["png", "pdf"]
fig.align: "center"
fig.width: 10
fig.height: 8
rmarkdown_site:
name: "GRAVI: Gene Regulatory Analysis"
output_dir: "../docs"
navbar:
title: "GRAVI"
right:
- icon: fa-github
href: "https://github.com/smped/GRAVI"
output:
html_document:
toc: yes
toc_float: yes
code_folding: hide
self_contained: false
theme: sandstone
highlight: textmate
css: gravi.css
includes:
after_body: footer.html