library(DiagrammeR)
library(tidyverse)
library(glue)
library(yaml)
library(here)
library(reactable)
library(pander)
config <- read_yaml(here::here("config/config.yml"))
samples <- read_tsv(here::here(config$samples))

Introduction

This is a simple snakemake workflow for preparing single-end ChIP-Seq libraries. The key steps involved are:

  1. Download files from the Sequence Read Archive
  2. Remove adapter sequences and low quality reads using AdapterRemoval (Schubert, Lindgreen, and Orlando 2016)
  3. Use bowtie2 to align to the genome (Langmead and Salzberg 2012)
  4. De-duplicate reads is markDuplicates
  5. Run macs2 callpeak (Zhang et al. 2008) on
    1. All individual samples
    2. All samples within a common target and treatment group

Multiple QC reports are created throughout the workflow using FastQC, MultiQC (Ewels et al. 2016) and ngsReports (Ward, To, and Pederson 2020)

A full description of the workflow is available at https://github.com/smped/prepareChIPs, with the rulegraph summarised below.

here::here("workflow", "rules", "rulegraph.dot") %>%
  readLines() %>%
  str_replace_all("(.+graph.+)(\\];)", "\\1, rankdir = LR\\2") %>% 
  str_replace_all("fontsize=10", "fontsize=12") %>% 
  str_replace_all("_", "\n") %>% 
  str_replace_all("snakemake\ndag", "snakemakg_dag") %>%
  (function(x) gsub("(color.+)(style)", "color = \\\"red\\\",\\2", x)) %>%
  (function(x) gsub("(compile|create|qc)(.+)(color.+)(style)", "\\1\\2color = \\\"forestgreen\\\",\\4", x)) %>% 
  (function(x) gsub("(get.+)(color.+)(style)", "\\1color = \\\"royalblue\\\",\\3", x)) %>%
  (function(x) gsub("(\\nfastq.+)(color.+)(style)", "\\1color = \\\"red\\\",\\3", x)) %>%
  grViz()

Snakemake rulegraph for the prepareChIPs workflow. The primary data processing workflow is showin in red, with QC and reporting steps shown in green. Any steps which produce additional, custom statistics are shown in blue.

Provided Samples

The complete set of samples provided is shown in the following table

samples %>% 
  rename_with(function(x) str_replace_all(x, "_", " "), everything()) %>% 
  rename_with(str_to_title, everything()) %>% 
  reactable(
    sortable = TRUE, filterable = TRUE, resizable = TRUE,
    showPageSizeOptions = TRUE
  )

Workflow Parameters

Downloaded files are to be aligned to a genome located in /hpcfs/users/a1018048/refs/gencode-release-33/GRCh37/dna/bt2/ with the prefix GRCh37.primary_assembly.genome

All other parameters were specified to be

  • adapterremoval:

    • extra: –maxns 1 –minlength 50 –minquality 30 –gzip –trimqualities
    • adapter1: –adapter1 AGATCGGAAGAGC
  • bowtie2:

  • fastqc: –nogroup –noextract

  • macs2:

    • callpeak: -g hs –keep-dup all -q 0.05
    • bdgcmp: -m FE
  • markduplicates: –REMOVE_DUPLICATES true

  • multiqc:

  • samtools:

    • sort:
    • view: -q2
    • index:

References

Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016. MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report.” Bioinformatics 32 (19): 3047–48.
Langmead, Ben, and Steven L Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nat. Methods 9 (4): 357–59.
Schubert, Mikkel, Stinus Lindgreen, and Ludovic Orlando. 2016. AdapterRemoval V2: Rapid Adapter Trimming, Identification, and Read Merging.” BMC Res. Notes 9 (February): 88.
Ward, Christopher M, Thu-Hien To, and Stephen M Pederson. 2020. “ngsReports: A Bioconductor Package for Managing FastQC Reports and Other NGS Related Log Files.” Bioinformatics 36 (8): 2587–88.
Zhang, Yong, Tao Liu, Clifford A Meyer, Jérôme Eeckhoute, David S Johnson, Bradley E Bernstein, Chad Nusbaum, et al. 2008. “Model-Based Analysis of ChIP-Seq (MACS).” Genome Biol. 9 (9): R137.

R version 4.2.3 (2023-03-15)

Platform: x86_64-conda-linux-gnu (64-bit)

locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: pander(v.0.6.5), reactable(v.0.4.4), here(v.1.0.1), yaml(v.2.3.7), glue(v.1.6.2), lubridate(v.1.9.2), forcats(v.1.0.0), stringr(v.1.5.0), dplyr(v.1.1.2), purrr(v.1.0.1), readr(v.2.1.4), tidyr(v.1.3.0), tibble(v.3.2.1), ggplot2(v.3.4.2), tidyverse(v.2.0.0) and DiagrammeR(v.1.0.10)

loaded via a namespace (and not attached): tidyselect(v.1.2.0), xfun(v.0.39), bslib(v.0.5.0), colorspace(v.2.1-0), vctrs(v.0.6.3), generics(v.0.1.3), htmltools(v.0.5.5), utf8(v.1.2.3), rlang(v.1.1.1), jquerylib(v.0.1.4), pillar(v.1.9.0), reactR(v.0.4.4), withr(v.2.5.0), bit64(v.4.0.5), RColorBrewer(v.1.1-3), lifecycle(v.1.0.3), munsell(v.0.5.0), gtable(v.0.3.3), visNetwork(v.2.1.2), htmlwidgets(v.1.6.2), evaluate(v.0.21), knitr(v.1.43), tzdb(v.0.4.0), fastmap(v.1.1.1), crosstalk(v.1.2.0), parallel(v.4.2.3), fansi(v.1.0.4), Rcpp(v.1.0.10), scales(v.1.2.1), cachem(v.1.0.8), vroom(v.1.6.3), jsonlite(v.1.8.5), bit(v.4.0.5), hms(v.1.1.3), digest(v.0.6.31), stringi(v.1.7.12), grid(v.4.2.3), rprojroot(v.2.0.3), cli(v.3.6.1), tools(v.4.2.3), magrittr(v.2.0.3), sass(v.0.4.6), crayon(v.1.5.2), pkgconfig(v.2.0.3), ellipsis(v.0.3.2), timechange(v.0.2.0), rmarkdown(v.2.21), rstudioapi(v.0.14), R6(v.2.5.1) and compiler(v.4.2.3)