Chapter 5 YAML Configuration Files

All YAML files which ruin the workflow are located in the project_home/config directory. The standard YAML structure is used in all files with the primary objective being passing workflow parameters to the various steps of the workflow. There are four files which control various aspects: config.yml, colours.yml, params.yml and rmarkdown.yml

5.1 The Main Configuration: config.yml

This file sets many of the primary parameters and is the file which will need editing for any new dataset. Many settings should remain unchanged as changing default locations of files may lead to unexpected instability in the workflow, whilst other setting should be changed, such as those which determine which comparisons to perform. This is the only file parsed directly by snakemake and subsequent rules, whilst all others are used to pass parameters to R environments.

An example layout of config.yml might be:

samples:
  file: "config/samples.tsv"

paths:
  bam: "data/bam"

genome:
  build: "GRCh37"

external:
  blacklist: "data/external/blacklist.bed.gz"
  gtf: "data/external/gencode.v33lift37.annotation.gtf.gz"
  rnaseq: "data/external/some_results.tsv"
  features: "data/external/custom_features.gtf.gz"
  hic: "data/external/hic_interactions.bedpe"
  coverage:
    H3K27Ac:
      control: "data/external/H3K27Ac_control.bw"
      treat: "data/external/H3K27Ac_treat.bw"

comparisons:
  method: 'sq-lt'
  fc: 1.2
  fdr: 0.05
  paired: false
  filter_q: 0.6
  contrasts:
    - ["control", "treat"]
  ihw: "regions"

peaks:
  macs2:
    gsize: "hs"
    fdr: 0.05
    keep_duplicates: "all"
  qc:
    outlier_threshold: 10
    allow_zero: true
    min_prop_reps: 0.5

Settings Which Don’t Need To Be Modified

In general, the paths to key files don’t need to be changed and default configurations are well tested. Whilst varying these has been intermittently attempted successfully, unexpected instability may occur and as such, is discouraged.

samples: (Default: “config/samples.tsv”)

By default, the file which contains all sample-level information is defined as config/samples.tsv. This can be changed to reflect any requirements after inspecting QC reports.

paths:

  • bam: (Default: “data/bam”) Alignments should be placed in data/bam as advised in section 4.2, although this can be changed to any other path as desired.

Settings Which Should Be Modified

genome:

Specify the genome build used for alignments and for gene annotations.

external:

Provide paths to all additional data files here. Only files provided will be included in the workflow.

  • blacklist Should be obtained from here
  • gtf: Should ideally be obtained from Gencode
  • (Optional) rnaseq should be the results as output by limma::topTable() or similar. Gene IDs should match those provided in the Gencode GTF (e.g. Ensembl IDs) and these should be in a column names gene_id. Columns such as logFC and FDR will be searched for during the workflow using regular expressions to find the best match. Can only be in csv or tsv format. Excel-specific (xls, xlsx) formats are not supported.
  • (Optional) features can be determined by any method, with common choices being relevant histone marks, or promoters, enhancers and super-enhancers determined by H3K27ac marks. Features should be non-overlapping with the field feature defining the different feature types. Must be provided in GTF format.
  • (Optional) hic HiC interactions must be provided as a bedpe file
  • (Optional) coverage Tracks provided in this argument will be added to all genomic plots showing binding peaks or differential binding. Multiple files provided within each YAML list element will be overlaid as a single track. There is no upper limit to the number of tracks, however more tracks generally detract from an informative figure.

comparisons:

These settings determine how the differential binding analysis is performed.

  • method (Default: “sq-lt”) Defines the strategy used for differential binding. The currently implemented options are “sq-lt” and “ls-ql” as described previously
  • fc (Default: 1.2) The setting of 1.2 indicates a 20% change in binding as the threshold below which we are not interested, or below which we consider binding changes to be inconsequential. This parameters is passed to limma::treat() (McCarthy and Smyth 2009) in all differential binding analyses.
  • fdr (Default: 0.05) Windows with significance below this threshold are considered to provide supporting evidence of differential binding.
  • paired (Default: false) If set to true the values in the optional column (e.g. replicate, passage etc.) are used to perform a paired analysis as described in the limma manual
  • filter_q (Default: 0.6) Passed to extraChIPs::dualFilter(). When filtering (i.e. discarding) genomic sliding windows which are unlikely to contain true binding signal, determine thresholds which will retain this proportion of windows which overlap a peak identified by macs2 callpeak.
  • contrasts: Define all contrasts for differential binding. Any ChIP target containing both treatment groups will be included for differential binding. Values must match those in the treat column of samples.tsv. Each differential binding analysis will be performed using the limma-trend method in the context of A vs B, such that complex models are not supported. Use new YAML list elements to define additional contrasts
  • ihw (Default: “regions”) Options used to stratify p-values for Independent Hypothesis Weighting(Ignatiadis et al. 2016) of differential binding results. Can take values "regions", "features", "targets" or "none"
    • "regions" P-values will be stratified by annotated genomic regions as determined in the initial steps of the workflow
    • "features" P-values will be stratified by provided external features
    • "targets" P-values will be stratified by the presence of a union peak using all other ChIP targets
    • "none" No Independent Hypothesis Weighting will be performed on the results from differential binding

peaks:

macs2 settings are passed to macs2 callpeak. Only the arguments gsize, fdr and keep_duplicates are accepted. Please see the macs2 manual for more detailed explanations.

qc parameters are used for determining if samples are of a high enough quality, and how to determine union/treatment peaks for each target and treatment group.

  • outlier_threshold (Default: 10) As previously described, samples with peak numbers beyond this value relative to the median peak numbers in each treatment group are marked for removal. To ignore this parameter, set this to .inf which effectively sets the threshold to \(\infty\)
  • allow_zero (Default: true) Allow samples with no identified peaks. If a ChIP target is expected to be cytoplasmic in one condition, this can allow samples with no peaks to be retained.
  • min_prop_reps (Default: 0.5) When forming treatment-specific peaks within each target and treatment group, peaks must be represented in at least this proportion of samples. This defaults to 0.5 which would equate to 2 of 3 samples, 2 of 4 samples etc. This value may need to be altered pending the results of a complete run after Quality Assessment has performed.

5.2 Colour Schemes: colours.yml [#colours-yml]

Defines all plotting colour schemes for consistency throughout the entire workflow. Colours can be any standard colours able to be interpreted by R, such as 'blue' or '#0000FF'. Recommended YAML list elements are qc, regions, direction and treat. A gradient through al colours provided in the heatmaps element will be generated during generation of heatmaps. Any unspecified colours will be automatically assigned and will propagate through the workflow. As is standard across most programming languages, names are case-sensitive. An example file is given below:

qc:
  pass: "#0571B0"
  fail: "#CA0020"
regions:
  promoter: '#FF3300'
  upstream_promoter: '#E1EE05'
  exon: '#7EDD57'
  intron: '#006600'
  proximal_intergenic: '#000066'
  distal_intergenic: '#551A8B'
treat:
  Input: "#33333380"
  E2: "#3333CC"
  E2DHT: "#C52240"
features:
  enhancer: "#FFFF00"
  no_feature: "#E5E5E5"
direction:
  up: "#CA0020"
  down: "#0571B0"
  unchanged: "#7F7F7F"
  undetected: "#E5E5E5"
heatmaps: ['white', 'red']

5.3 Additional Parameters: params.yml

Default settings for defining initial annotations, mapping of peaks to genes and enrichment testing. In general, these will not need to be changed, but can be if required. In particular, the value enh2gene can be set to zero if using data such as H3K27ac HiChIP to determine long-range interactions.

For enrichment testing, the preferred p-value adjustment method can be specified using any value that match p.adjust.methods with an appropriate threshold also set in this section. Default settings are relatively inclusive and more stringent setting may be preferred by some users.

Parameters for msigdb(Liberzon et al. 2015) are passed to msigdbr(Dolgalev 2022) and fields should match this layout. Any categories passed to gs_cat will lead to all subcategories being used from that category. Specific sub-categories of larger databases can be passed using gs_subcat element.

Values which determine the minimum and maximum size of any network plots are also defined here. Node-pair distances above the given distance will be removed from the plot and these edges excluded.

gene_regions:
  promoters:
    upstream: 1500
    downstream: 500
  upstream: 5000
  intergenic: 10000

## The values used when mapping peaks to genes.
## Passed to `extraChIPs::mapByFeature()`
## If including H3K27ac HiChIP for long range-interactions, it is advised to
## set `enh2gene` as zero, given that long range interactions in this case
## will more accurately map long-range enhancer interactions than using all
## genes within a given distance
mapping:
  gr2gene: 100000
  prom2gene: 0
  enh2gene: 100000
  gi2gene: 0

enrichment:
  adj: "fdr"
  alpha: 0.05
  ## Only gene-sets between these two values will be retained before testing
  min_size: 5
  max_size: 1000
  ## Only gene-sets above this size will be shown in the results
  min_sig: 3
  ## The categories to use from MSigDB. These are passed to msigdbr in the
  ## columns of the same name in an 'OR' approach
  species: "Homo sapiens"
  msigdb:
    gs_cat: "H"
    gs_subcat:
      - "CP:KEGG"
      - "CP:REACTOME"
      - "CP:WIKIPATHWAYS"
      - "TFT:GTRD"

## Used for network plots
networks:
  min_size: 4
  max_size: 80
  max_distance: 0.9
  layout: 'fr'

5.4 HTML Settings: rmarkdown.yml

The main workflow will produce a compiled set of HTML pages using rmarkdown::render_site()(Allaire et al. 2022). The two available fields to supply here are:

  1. knitr_opts which are passed to knitr::opts_chunk$set()(Xie 2022) at the beginning of every compiled Rmarkdown document, and
  2. rmarkdown_site which determines the layout and style of the final HTML report. All left elements of the navbar are determined automatically during the workflow and will be ignored if supplied here, whilst all other parameters are passed via the file _site.yaml which will be generated during the workflow.

Figure height and width parameters must be passed to knitr::opts_set() in inches, and the workflow defaults to svg output for most figures. Some figures which can become excessively large under this setting, such as those with many thousands of overlapping points, are set to always produce png output. Multiple output formats can also be set using values such as ['png', 'pdf'] if preferred.

knitr_opts:
  echo: TRUE
  message: FALSE
  warning: FALSE
  dev: ["png", "pdf"]
  fig.align: "center"
  fig.width: 10
  fig.height: 8

rmarkdown_site:
  name: "GRAVI: Gene Regulatory Analysis"
  output_dir: "../docs"
  navbar:
    title: "GRAVI"
    right:
      - icon: fa-github
        href: "https://github.com/smped/GRAVI"
  output:
    html_document:
      toc: yes
      toc_float: yes
      code_folding: hide
      self_contained: false
      theme: sandstone
      highlight: textmate
      css: gravi.css
      includes:
        after_body: footer.html

References

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2022. Rmarkdown: Dynamic Documents for r. https://CRAN.R-project.org/package=rmarkdown.
Dolgalev, Igor. 2022. Msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format. https://CRAN.R-project.org/package=msigdbr.
Ignatiadis, N., B. Klaus, J. B. Zaugg, and W. Huber. 2016. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing.” Nat Methods 13 (7): 577–80.
Liberzon, A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, J. P. Mesirov, and P. Tamayo. 2015. The Molecular Signatures Database (MSigDB) hallmark gene set collection.” Cell Syst 1 (6): 417–25.
McCarthy, Davis J., and Gordon K. Smyth. 2009. Testing significance relative to a fold-change threshold is a TREAT.” Bioinformatics 25 (6): 765–71. https://doi.org/10.1093/bioinformatics/btp053.
Xie, Yihui. 2022. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.