extraChIPs

Bioc 2023 Package Demonstration

Stevie Pederson (They/Them)

August 2, 2023

https://smped.github.io/Bioc2023_extraChIPs/

Introduction

Who Am I?

  • Post-doctoral Bioinformatician at Black Ochre Data Labs
    • Telethon Kids Institute, Adelaide, Australia
    • Traditional lands of the Kaurna nation
  • Improving health outcomes for Indigenous Australians
    • Type 2 Diabetes, Cardiovascular Disease, Chronic Kidney Disease
  • 1400 participants in multi-omics study
    • Genomics, Epigenomics, Transcriptomics, Proteomics, Lipidomics etc
  • Also the developer/maintainer of ngsReports

Why extraChIPs?

  • Dame Roma Mitchell Cancer Research Laboratories (UofA, 2020-2022)
    • Integrating multiple ChIP targets across multiple cell lines
    • Regulatory dynamics of AR activation with DHT (or SARMs)
  • Built to enable the GRAVI workflow:
    • Gene Regulatory Analysis using Variable Inputs (Poster Thursday)
    • Integration of multiple ChIP targets with RNA-Seq + HiC
    • Wanted to standardise ‘best practice’ across multiple datasets
    • Design evolved alongside GRAVI

Why extraChIPs?

  • Simplify common processes:
    • Flexible detection of Differential Signal (Fixed-Width or Sliding Windows)
    • Maintaining mcols throughout an analysis
    • Easier comparison across multiple targets or treatments
    • Enable common visualisations
  • Use existing object classes:
    • SummarizedExperiment, GRanges, GRangesList, List
  • Applicable well beyond ChIP-Seq (DNA-methylation, ATAC-Seq etc)

Key Functions

Core Infrastructure

  • reduceMC, intersectMC, setdiffMC, unionMC
  • chopMC, distinctMC
  • as_tibble, colToRanges


Working With Peaks/GRanges

  • importPeaks, makeConsensus, plotOverlaps
  • bestOverlap, propOverlap
  • defineRegions, mapByFeature
  • mapGrlCols 1

Differential Signal Analysis

  • fitAssayDiff, addDiffStatus
  • dualFilter,
  • mergeByHMP, mergeByCol, mergeBySig


Visualisations

  • plotAssayDensities, plotAssayPCA, plotAssayRle
  • plotPie, plotSplitDonut
  • plotPairwise 1
  • plotProfileHeatmap, plotHFGC

Core Infrastructure

*MC Functions

  • Existing functions reduce, setdiff, intersect, union
    • All will drop information in the mcols element
  • reduceMC, setdiffMC, intersectMC and unionMC
    • Retain all information in mcols element
    • Returned as CompressedList or atomic vectors
    • Heavily used internally

reduce Vs. reduceMC

x
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |          id        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]     chr1      1-10      * |      range1       gene1
  [2]     chr1      6-12      * |      range2       gene1
  -------
  seqinfo: 1 sequence from example genome

reduce will simply combine (i.e. reduce) overlapping ranges

reduce(x)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-12      *
  -------
  seqinfo: 1 sequence from example genome

reduce Vs. reduceMC

x
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |          id        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]     chr1      1-10      * |      range1       gene1
  [2]     chr1      6-12      * |      range2       gene1
  -------
  seqinfo: 1 sequence from example genome

reduceMC will also combine values in the mcols element

reduceMC(x)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |              id        gene
         <Rle> <IRanges>  <Rle> | <CharacterList> <character>
  [1]     chr1      1-12      * |   range1,range2       gene1
  -------
  seqinfo: 1 sequence from example genome

mcols are simplified by default (simplify = TRUE)

tibble Coercion

  • GRanges can be coerced to tibble objects
  • Ranges are coerced to a character column
tbl <- as_tibble(x)
tbl
# A tibble: 2 × 3
  range     id     gene 
  <chr>     <chr>  <chr>
1 chr1:1-10 range1 gene1
2 chr1:6-12 range2 gene1
  • rangeAsChar = FALSE will produce similar to as.data.frame
as_tibble(x, rangeAsChar = FALSE)
# A tibble: 2 × 7
  seqnames start   end width strand id     gene 
  <fct>    <int> <int> <int> <fct>  <chr>  <chr>
1 chr1         1    10    10 *      range1 gene1
2 chr1         6    12     7 *      range2 gene1

Coercing tibble objects back to GRanges

tbl
# A tibble: 2 × 3
  range     id     gene 
  <chr>     <chr>  <chr>
1 chr1:1-10 range1 gene1
2 chr1:6-12 range2 gene1
colToRanges(tbl, var = "range")
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |          id        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]     chr1      1-10      * |      range1       gene1
  [2]     chr1      6-12      * |      range2       gene1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Seqinfo objects can also be provided

colToRanges(tbl, var = "range", seqinfo = seqinfo(x))
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |          id        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]     chr1      1-10      * |      range1       gene1
  [2]     chr1      6-12      * |      range2       gene1
  -------
  seqinfo: 1 sequence from example genome

colToRanges

  • colToRanges can also move GRanges objects in mcols to the ‘backbone’ ranges
x$centre <- GRanges(paste0("chr1:", c(5, 10)), seqinfo = seqinfo(x))
x
GRanges object with 2 ranges and 3 metadata columns:
      seqnames    ranges strand |          id        gene    centre
         <Rle> <IRanges>  <Rle> | <character> <character> <GRanges>
  [1]     chr1      1-10      * |      range1       gene1    chr1:5
  [2]     chr1      6-12      * |      range2       gene1   chr1:10
  -------
  seqinfo: 1 sequence from example genome


colToRanges(x, var = "centre")
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |          id        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]     chr1         5      * |      range1       gene1
  [2]     chr1        10      * |      range2       gene1
  -------
  seqinfo: 1 sequence from example genome

Working With Peaks

Demonstration Data 1

## Define Samples
targets <- c("ER", "H3K27ac")
treat_levels <- c("E2", "E2DHT")
samples <- tibble(
  accession = paste0("SRR83151", 80:91),
  target = rep(targets, each = 6),
  treatment = treat_levels %>% 
    rep(each = 3) %>% 
    rep_len(12) %>% 
    factor(levels = treat_levels),
  input = "SRR8315192"
)
## Split for easier wrangling
samples <- split(samples, samples$target)
samples
$ER
# A tibble: 6 × 4
  accession  target treatment input     
  <chr>      <chr>  <fct>     <chr>     
1 SRR8315180 ER     E2        SRR8315192
2 SRR8315181 ER     E2        SRR8315192
3 SRR8315182 ER     E2        SRR8315192
4 SRR8315183 ER     E2DHT     SRR8315192
5 SRR8315184 ER     E2DHT     SRR8315192
6 SRR8315185 ER     E2DHT     SRR8315192

$H3K27ac
# A tibble: 6 × 4
  accession  target  treatment input     
  <chr>      <chr>   <fct>     <chr>     
1 SRR8315186 H3K27ac E2        SRR8315192
2 SRR8315187 H3K27ac E2        SRR8315192
3 SRR8315188 H3K27ac E2        SRR8315192
4 SRR8315189 H3K27ac E2DHT     SRR8315192
5 SRR8315190 H3K27ac E2DHT     SRR8315192
6 SRR8315191 H3K27ac E2DHT     SRR8315192

Data Directory Contents

-- annotations
   |__chr10_blacklist.bed
   |__chr10_greylist.bed
   |__gencode.v43lift37.chr10.annotation.gtf.gz
-- ER
   |__E2_cov_chr10.bw
   |__E2DHT_cov_chr10.bw
   |__SRR8315180_peaks.narrowPeak
   |__SRR8315180.bam
   |__SRR8315180.bam.bai
   |__SRR8315181_peaks.narrowPeak
   |__SRR8315181.bam
   |__SRR8315181.bam.bai
   |__SRR8315182_peaks.narrowPeak
   |__SRR8315182.bam
   |__SRR8315182.bam.bai
   |__SRR8315183_peaks.narrowPeak
   |__SRR8315183.bam
   |__SRR8315183.bam.bai
   |__SRR8315184_peaks.narrowPeak
   |__SRR8315184.bam
   |__SRR8315184.bam.bai
   |__SRR8315185_peaks.narrowPeak
   |__SRR8315185.bam
   |__SRR8315185.bam.bai
-- H3K27ac
   |__E2_cov_chr10.bw
   |__E2DHT_cov_chr10.bw
   |__H3K27ac_chr10.bed
   |__SRR8315186.bam
   |__SRR8315186.bam.bai
   |__SRR8315187.bam
   |__SRR8315187.bam.bai
   |__SRR8315188.bam
   |__SRR8315188.bam.bai
   |__SRR8315189.bam
   |__SRR8315189.bam.bai
   |__SRR8315190.bam
   |__SRR8315190.bam.bai
   |__SRR8315191.bam
   |__SRR8315191.bam.bai
   |__SRR8315192.bam
   |__SRR8315192.bam.bai

Additional Objects

## A consistent Seqinfo object
sq <- Seqinfo(
  seqnames = "chr10", seqlengths = 135534747, isCircular = FALSE,  genome = "hg19"
)

## Gene Annotations as a GTF, split by feature type
gtf <- file.path(
  "data", "annotations", "gencode.v43lift37.chr10.annotation.gtf.gz"
) %>% 
  import.gff() %>% 
  split(.$type)
seqinfo(gtf) <- sq

## Black & grey-lists (From ENCODE & Input Sample)
bg_list <- file.path("data", "annotations", c("chr10_blacklist.bed", "chr10_greylist.bed")) %>% 
  lapply(import.bed, seqinfo = sq) %>% 
  lapply(granges) %>% 
  GRangesList() %>% 
  unlist() %>% 
  sort()

## Setup colours for consistent plotting
treat_colours <- c(E2 = "steelblue", E2DHT = "red3")
sample_colours <- setNames(
  treat_colours[bind_rows(samples)$treatment], 
  bind_rows(samples)$accession
)

Working With Peaks

Key Steps:

  1. Load peaks identified in individual samples (e.g. macs2 callpeak)
  1. Find overlapping regions from \(> p\times n\) replicates \(\implies\) consensus peaks
    • Usually formed within treatment groups
    • Merged across treatment groups

Peak callers often produce narrowPeak/broadPeak files

Importing Peaks

importPeaks(
  x,
  type = c("narrow", "broad"),
  blacklist,
  seqinfo,
  pruning.mode = c("coarse", "error"),
  sort = TRUE,
  setNames = TRUE,
  glueNames = "{basename(x)}",
  centre = FALSE,
  nameRanges = TRUE,
  ...
)
  • Import broad/narrowPeak files \(\implies\) GRangesList
  • Blacklisted regions can be omitted during parsing
  • Only chromosomes/scaffolds within the seqinfo object will be returned
  • Peak Centres can be automatically added for narrowPeak files

Importing Peaks

fl <- file.path("data", "ER", paste0(samples$ER$accession, "_peaks.narrowPeak"))
peaks <- importPeaks(fl, blacklist = bg_list, seqinfo = sq, centre = TRUE)
names(peaks) <- samples$ER$accession
peaks
GRangesList object of length 6:
$SRR8315180
GRanges object with 123 ranges and 6 metadata columns:
                      seqnames            ranges strand |     score signalValue
                         <Rle>         <IRanges>  <Rle> | <numeric>   <numeric>
  SRR8315180_peak_698    chr10 43048224-43048487      * |       251    13.91781
  SRR8315180_peak_699    chr10 43521883-43522126      * |       223    12.88686
  SRR8315180_peak_700    chr10 43540200-43540384      * |        58     6.18569
  SRR8315180_peak_701    chr10 43606308-43606517      * |        92     7.73212
  SRR8315180_peak_702    chr10 44092187-44092576      * |       251    13.91781
                  ...      ...               ...    ... .       ...         ...
  SRR8315180_peak_819    chr10 95796039-95796377      * |       434    18.67842
  SRR8315180_peak_820    chr10 96089595-96089749      * |        25     4.32096
  SRR8315180_peak_821    chr10 99097150-99097384      * |       196     9.02984
  SRR8315180_peak_822    chr10 99168353-99168578      * |        48     5.67022
  SRR8315180_peak_823    chr10 99331430-99331686      * |       210    10.70668
                         pValue    qValue      peak    centre
                      <numeric> <numeric> <numeric> <numeric>
  SRR8315180_peak_698  29.15643  25.19601       117  43048341
  SRR8315180_peak_699  26.25515  22.33688       125  43522008
  SRR8315180_peak_700   9.37883   5.87296        98  43540298
  SRR8315180_peak_701  12.90188   9.27326       110  43606418
  SRR8315180_peak_702  29.15643  25.19601       214  44092401
                  ...       ...       ...       ...       ...
  SRR8315180_peak_819  47.62759  43.44466       186  95796225
  SRR8315180_peak_820   5.88899   2.58967         6  96089601
  SRR8315180_peak_821  23.51470  19.64532       109  99097259
  SRR8315180_peak_822   8.26998   4.82385       153  99168506
  SRR8315180_peak_823  24.91666  21.02332       191  99331621
  -------
  seqinfo: 1 sequence from hg19 genome

...
<5 more elements>

Comparing Replicates

plotOverlaps(peaks, set_col = sample_colours)
  • Calls ComplexUpset::upset() if n \(\geq\) 3

makeConsensus

makeConsensus(peaks[1:3], var = "centre", p = 2/3)
GRanges object with 164 ranges and 5 metadata columns:
        seqnames            ranges strand |                     centre
           <Rle>         <IRanges>  <Rle> |              <NumericList>
    [1]    chr10 43048195-43048529      * | 43048341,43048357,43048389
    [2]    chr10 43521739-43522260      * | 43522008,43522012,43522039
    [3]    chr10 43540042-43540390      * |          43540298,43540245
    [4]    chr10 43606238-43606573      * | 43606418,43606408,43606421
    [5]    chr10 43851214-43851989      * |          43851672,43851766
    ...      ...               ...    ... .                        ...
  [160]    chr10 99096784-99097428      * | 99097259,99097253,99097249
  [161]    chr10 99168353-99168649      * | 99168506,99168486,99168513
  [162]    chr10 99207868-99208156      * |          99207995,99208002
  [163]    chr10 99331363-99331730      * | 99331621,99331585,99331580
  [164]    chr10 99621632-99621961      * |          99621809,99621828
        SRR8315180 SRR8315181 SRR8315182         n
         <logical>  <logical>  <logical> <numeric>
    [1]       TRUE       TRUE       TRUE         3
    [2]       TRUE       TRUE       TRUE         3
    [3]       TRUE      FALSE       TRUE         2
    [4]       TRUE       TRUE       TRUE         3
    [5]      FALSE       TRUE       TRUE         2
    ...        ...        ...        ...       ...
  [160]       TRUE       TRUE       TRUE         3
  [161]       TRUE       TRUE       TRUE         3
  [162]      FALSE       TRUE       TRUE         2
  [163]       TRUE       TRUE       TRUE         3
  [164]      FALSE       TRUE       TRUE         2
  -------
  seqinfo: 1 sequence from hg19 genome


  • All/any/no mcols can be returned

makeConsensus

  • By default, the union of overlapping ranges will be returned
    • between black lines


makeConsensus

  • Can also return only ranges covered by \(>p\times n\) samples
    • between red lines
  • method = "coverage"

Consensus Peaks

To find peaks identified in 2/3 samples in either condition:

treatment_peaks <- list(
    E2 = makeConsensus(peaks[1:3], var = "centre", p = 2/3),
    E2DHT = makeConsensus(peaks[4:6], var = "centre", p = 2/3)
) %>% 
    lapply(select, centre) %>% 
    GRangesList()
treatment_peaks
GRangesList object of length 2:
$E2
GRanges object with 164 ranges and 1 metadata column:
        seqnames            ranges strand |                     centre
           <Rle>         <IRanges>  <Rle> |              <NumericList>
    [1]    chr10 43048195-43048529      * | 43048341,43048357,43048389
    [2]    chr10 43521739-43522260      * | 43522008,43522012,43522039
    [3]    chr10 43540042-43540390      * |          43540298,43540245
    [4]    chr10 43606238-43606573      * | 43606418,43606408,43606421
    [5]    chr10 43851214-43851989      * |          43851672,43851766
    ...      ...               ...    ... .                        ...
  [160]    chr10 99096784-99097428      * | 99097259,99097253,99097249
  [161]    chr10 99168353-99168649      * | 99168506,99168486,99168513
  [162]    chr10 99207868-99208156      * |          99207995,99208002
  [163]    chr10 99331363-99331730      * | 99331621,99331585,99331580
  [164]    chr10 99621632-99621961      * |          99621809,99621828
  -------
  seqinfo: 1 sequence from hg19 genome

$E2DHT
GRanges object with 160 ranges and 1 metadata column:
        seqnames            ranges strand |                     centre
           <Rle>         <IRanges>  <Rle> |              <NumericList>
    [1]    chr10 43048235-43048520      * | 43048383,43048373,43048388
    [2]    chr10 43521795-43522191      * | 43521992,43522008,43522026
    [3]    chr10 43540202-43540457      * | 43540318,43540364,43540361
    [4]    chr10 43606184-43606572      * |          43606426,43606411
    [5]    chr10 44092131-44092563      * | 44092370,44092321,44092385
    ...      ...               ...    ... .                        ...
  [156]    chr10 99168349-99168690      * |          99168488,99168465
  [157]    chr10 99207862-99208157      * | 99207990,99207997,99208044
  [158]    chr10 99331323-99331741      * | 99331567,99331605,99331594
  [159]    chr10 99340107-99340354      * |          99340208,99340289
  [160]    chr10 99621638-99621988      * |          99621822,99621733
  -------
  seqinfo: 1 sequence from hg19 genome

Consensus Peaks

plotOverlaps(treatment_peaks, set_col = treat_colours)
  • Produces a VennDiagram when n = 2

Consensus Peaks

union_peaks <- makeConsensus(treatment_peaks, var = "centre") %>% 
  mutate(
    centre =  round(vapply(centre, mean, numeric(1)), 0),
    centre = paste0(seqnames, ":", centre) %>% GRanges(seqinfo = sq)
  ) %>% 
  select(centre)
union_peaks
GRanges object with 188 ranges and 1 metadata column:
        seqnames            ranges strand |         centre
           <Rle>         <IRanges>  <Rle> |      <GRanges>
    [1]    chr10 43048195-43048529      * | chr10:43048372
    [2]    chr10 43521739-43522260      * | chr10:43522014
    [3]    chr10 43540042-43540457      * | chr10:43540317
    [4]    chr10 43606184-43606573      * | chr10:43606417
    [5]    chr10 43851214-43851989      * | chr10:43851719
    ...      ...               ...    ... .            ...
  [184]    chr10 99168349-99168690      * | chr10:99168492
  [185]    chr10 99207862-99208157      * | chr10:99208006
  [186]    chr10 99331323-99331741      * | chr10:99331592
  [187]    chr10 99340107-99340354      * | chr10:99340248
  [188]    chr10 99621632-99621988      * | chr10:99621798
  -------
  seqinfo: 1 sequence from hg19 genome
  • Peak centres are the mean from all replicates with an overlapping peak

Mapping Peaks

  • After finding consensus/union peaks:
    • What type of regulatory regions?
    • What genes are likely targets?
  • defineRegions() uses gene, transcript and exon-level information
    • GRangesList of non-overlapping regions (promoter, intron etc)
    • Map peaks to these regions using bestOverlap()
  • mapByFeature() assigns peaks/ranges to most likely genes
    • Incorporates promoters, enhancers and long-range interactions

defineRegions

defineRegions(
  genes, transcripts, exons,
  promoter = c(2500, 500),
  upstream = 5000,
  intron = TRUE,
  proximal = 10000,
  simplify = FALSE,
  cols = c("gene_id", "gene_name", "transcript_id", "transcript_name"),
  ...
)

Unique Genomic Regions are defined hierarchically using a supplied gtf

  1. Promoters
  2. Upstream Promoters
  3. Exons
  4. Introns
  5. Proximal Intergenic
  6. Distal Intergenic

defineRegions

regions <- defineRegions(genes = gtf$gene, transcripts = gtf$transcript, exons = gtf$exon)
names(regions)
[1] "promoter"            "upstream_promoter"   "exon"               
[4] "intron"              "proximal_intergenic" "distal_intergenic"  
regions$promoter
GRanges object with 3359 ranges and 5 metadata columns:
         seqnames              ranges strand |                region
            <Rle>           <IRanges>  <Rle> |           <character>
     [1]    chr10         61989-64988      * | Promoter (-2500/+500)
     [2]    chr10         66360-69392      * | Promoter (-2500/+500)
     [3]    chr10         88152-91151      * | Promoter (-2500/+500)
     [4]    chr10         94710-97961      * | Promoter (-2500/+500)
     [5]    chr10       119604-122603      * | Promoter (-2500/+500)
     ...      ...                 ...    ... .                   ...
  [3355]    chr10 135488075-135491074      * | Promoter (-2500/+500)
  [3356]    chr10 135491384-135494383      * | Promoter (-2500/+500)
  [3357]    chr10 135494684-135497683      * | Promoter (-2500/+500)
  [3358]    chr10 135503135-135506134      * | Promoter (-2500/+500)
  [3359]    chr10 135513071-135518323      * | Promoter (-2500/+500)
                                                                 gene_id
                                                         <CharacterList>
     [1]                                            ENSG00000260370.2_11
     [2]  ENSG00000260370.2_11,ENSG00000260370.2_11,ENSG00000260370.2_11
     [3]                                             ENSG00000237297.1_7
     [4] ENSG00000261456.6_7,ENSG00000261456.6_7,ENSG00000261456.6_7,...
     [5]                                             ENSG00000261456.6_7
     ...                                                             ...
  [3355]                                             ENSG00000276046.1_5
  [3356]                                             ENSG00000276904.1_5
  [3357]                                             ENSG00000278641.1_5
  [3358]                                               ENSG00000226880.1
  [3359] ENSG00000213147.3_3,ENSG00000261914.2_3,ENSG00000282875.1_8,...
                                               gene_name
                                         <CharacterList>
     [1]                                 ENSG00000260370
     [2] ENSG00000260370,ENSG00000260370,ENSG00000260370
     [3]                                 ENSG00000237297
     [4]                           TUBB8,TUBB8,TUBB8,...
     [5]                                           TUBB8
     ...                                             ...
  [3355]                                         DUX4L13
  [3356]                                         DUX4L14
  [3357]                                         DUX4L15
  [3358]                                    XX-2136C48.7
  [3359]         RPL23AP60,RPL23AP84,ENSG00000282875,...
                                                           transcript_id
                                                         <CharacterList>
     [1]                                             ENST00000562162.2_4
     [2]     ENST00000682078.1_1,ENST00000566940.2_4,ENST00000683798.1_1
     [3]                                             ENST00000416477.1_2
     [4] ENST00000568584.6_6,ENST00000568866.5_4,ENST00000561967.1_4,...
     [5]                                             ENST00000564130.2_4
     ...                                                             ...
  [3355]                                             ENST00000554103.2_2
  [3356]                                             ENST00000621227.1_2
  [3357]                                             ENST00000611059.1_2
  [3358]                                               ENST00000411632.1
  [3359] ENST00000450011.1_2,ENST00000571193.2_2,ENST00000635574.1_4,...
                                         transcript_name
                                         <CharacterList>
     [1]                                 ENST00000562162
     [2] ENST00000682078,ENST00000566940,ENST00000683798
     [3]                                 ENST00000416477
     [4]               TUBB8-206,TUBB8-207,TUBB8-201,...
     [5]                                       TUBB8-204
     ...                                             ...
  [3355]                                     DUX4L13-201
  [3356]                                     DUX4L14-201
  [3357]                                     DUX4L15-201
  [3358]                                XX-2136C48.7-001
  [3359] RPL23AP60-201,RPL23AP84-201,ENST00000635574,...
  -------
  seqinfo: 1 sequence from hg19 genome

bestOverlap

region_levels <- vapply(regions, function(x) x$region[1], character(1))
union_peaks$region <- bestOverlap(union_peaks, unlist(regions), var = "region")
union_peaks$region <- factor(union_peaks$region, levels = region_levels)
union_peaks
GRanges object with 188 ranges and 2 metadata columns:
        seqnames            ranges strand |         centre
           <Rle>         <IRanges>  <Rle> |      <GRanges>
    [1]    chr10 43048195-43048529      * | chr10:43048372
    [2]    chr10 43521739-43522260      * | chr10:43522014
    [3]    chr10 43540042-43540457      * | chr10:43540317
    [4]    chr10 43606184-43606573      * | chr10:43606417
    [5]    chr10 43851214-43851989      * | chr10:43851719
    ...      ...               ...    ... .            ...
  [184]    chr10 99168349-99168690      * | chr10:99168492
  [185]    chr10 99207862-99208157      * | chr10:99208006
  [186]    chr10 99331323-99331741      * | chr10:99331592
  [187]    chr10 99340107-99340354      * | chr10:99340248
  [188]    chr10 99621632-99621988      * | chr10:99621798
                          region
                        <factor>
    [1]    Promoter (-2500/+500)
    [2]    Intergenic (>10kb)   
    [3]    Intergenic (>10kb)   
    [4]    Intron               
    [5]    Intergenic (<10kb)   
    ...                      ...
  [184] Intron                  
  [185] Promoter (-2500/+500)   
  [186] Promoter (-2500/+500)   
  [187] Upstream Promoter (-5kb)
  [188] Upstream Promoter (-5kb)
  -------
  seqinfo: 1 sequence from hg19 genome


Now we have peaks annotated to regions \(\implies\) plotPie()

plotPie

plotPie

plotPie(
  union_peaks, fill = "region", 
  min_p = 0.05, total_size = 4.5, cat_alpha = 0.7, cat_adj = 0.05,
  cat_glue = "{str_wrap(.data[[fill]], 15)}\nn = {n}\n({percent(p, 0.1)})"
) +  
  scale_fill_viridis_d(direction = -1)
  • Label text is highly customisable
    • Uses glue syntax \(\implies\) can include function calls and line-breaks
    • Totals available as n, proportions as p, categories as .data[[fill]]
  • Slices can also be scaled by width

mapByFeature

mapByFeature(
  gr,    # Ranges to be mapped
  genes, # Gene-annotations
  prom,  # Ranges annotated as promoters
  enh,   # Ranges annotated as enhancers
  gi,    # GInteractions object
  cols = c("gene_id", "gene_name", "symbol"), # Columns to return (if present)
  gr2prom = 0, gr2enh = 0, gr2gi = 0, # Distances between ranges & features
  gr2gene = 1e+05, prom2gene = 0, enh2gene = 1e+05, gi2gene = 0, ...
)

Ranges overlapping:

  1. a promoter \(\implies\) assigned to that gene
  2. an enhancer \(\implies\) all genes within a specified distance
  3. a long-range interaction \(\implies\) all genes connected by the interaction
  4. none-of the above:
    1. all overlapping genes or
    2. the nearest gene within a specified distance

mapByFeature

union_peaks <- mapByFeature(union_peaks, genes = gtf$gene, prom = regions$promoter)
union_peaks[1:5]
GRanges object with 5 ranges and 4 metadata columns:
      seqnames            ranges strand |         centre                region
         <Rle>         <IRanges>  <Rle> |      <GRanges>              <factor>
  [1]    chr10 43048195-43048529      * | chr10:43048372 Promoter (-2500/+500)
  [2]    chr10 43521739-43522260      * | chr10:43522014 Intergenic (>10kb)   
  [3]    chr10 43540042-43540457      * | chr10:43540317 Intergenic (>10kb)   
  [4]    chr10 43606184-43606573      * | chr10:43606417 Intron               
  [5]    chr10 43851214-43851989      * | chr10:43851719 Intergenic (<10kb)   
                                       gene_id       gene_name
                               <CharacterList> <CharacterList>
  [1] ENSG00000234420.9_10,ENSG00000270762.1_3 ZNF37BP,FXYD6P1
  [2]                        ENSG00000263795.1         MIR5100
  [3]                    ENSG00000165731.21_15             RET
  [4]                    ENSG00000165731.21_15             RET
  [5]                      ENSG00000285712.1_7 ENSG00000285712
  -------
  seqinfo: 1 sequence from hg19 genome

Differential Signal Analysis

Two Approaches to Differential Signal Analysis

Highly comparable to DiffBind

  1. Define a set of fixed-width ranges:
    • Centres from importPeaks() \(\implies\) makeConsensus()
    • Resize to fixed-width
  2. Count Reads using csaw::regionCounts() + BamFileList
    • Returns a RangedSummarizedExperiment
  3. Perform analysis using fitAssayDiff()

Extends csaw approaches

  1. Define window size & step
  2. Count reads using csaw::windowCounts() + BamFileList
    • Returns a RangedSummarizedExperiment
  3. Filter windows for those likely containing signal: dualFilter()
    • Uses a set of ‘guide ranges’ to set filtering thresholds
  4. Fit windows using fitAssayDiff()
  5. Merge neighbouring windows for final results

Differential Signal Analysis

  • Visualisation functions after counting reads
    • plotAssayPCA, plotAssayRle, plotAssayDensities
  • Differential Signal Analysis: fitAssayDiff()
    • edgeR::glmQLF() or limma-trend
  • Sliding window analysis \(\implies\) merge neighboring windows
    • mergeByHMP() uses harmonic-mean p (Wilson 2019)
    • mergeByCol(), mergeBySig(): wraps csaw methods
    • All return the relevant underlying windows (keyval_range)

Fixed-Width Windows

library(Rsamtools)
library(csaw)
## Centre & resize the peaks
centred_peaks <- union_peaks %>% 
  mutate(union_peak = granges(.)) %>%  # Add the original ranges to mcols
  colToRanges(var = "centre") %>% # Move the centred range to the 'backbone'
  resize(width = 400, fix = "center") # Resize to a fixed with
## Define the BamFileList
er_bfl <- file.path("data", "ER", paste0(samples$ER$accession, ".bam")) %>% 
  BamFileList() %>% 
  setNames(samples$ER$accession)
## Count Reads
er_se <- regionCounts(er_bfl, centred_peaks, ext = 200)
seqinfo(er_se) <- sq
## Re-assign union peaks to core ranges after counting
rowRanges(er_se) <- rowRanges(er_se) %>% 
  colToRanges(var = "union_peak") %>% 
  select(region, starts_with("gene"))
## Add all sample annotations
colData(er_se) <- colData(er_se) %>% 
  as_tibble(rownames = "accession") %>% 
  left_join(samples$ER) %>% 
  as("DataFrame")

Visualising Counts

plotAssayDensities(er_se, trans = "log1p", colour = "treatment") +
  scale_colour_manual(values = treat_colours)

plotAssayPCA(er_se, trans = "log1p", colour = "treatment", label = "accession") +
  scale_colour_manual(values = treat_colours)

RLE Plots

plotAssayRle(er_se, trans = "log1p", fill = "treatment") +
  scale_fill_manual(values = treat_colours)

plotAssayRle(er_se, trans = "log1p", fill = "treatment", by_x = "treatment") +
  scale_fill_manual(values = treat_colours)

Sliding Windows

## Define the BamFile List INCLUDING the Input
acc <- c(samples$H3K27ac$accession, samples$H3K27ac$input) %>% 
  unique()
h3k_bfl <- file.path("data", "H3K27ac", paste0(acc, ".bam")) %>% 
  BamFileList() %>% 
  setNames(acc)
## Setup the parameters for windows & counts
rp <- readParam(pe = "none", restrict = "chr10", discard = bg_list)
win_size <- 150
win_step <- 50
## Count reads for all windows
wincounts <- windowCounts(
  bam.files = h3k_bfl,
  spacing = win_step, width = win_size, ext = 200,
  filter = length(h3k_bfl), param = rp
)
dim(wincounts)
[1] 467904      7

Filter Sliding Windows

  • Retain windows which are most likely to contain signal
## Load in the pre-defined ranges
guide_ranges <- file.path("data", "H3K27ac", "H3K27ac_chr10.bed") %>% 
  import.bed(seqinfo = sq)
## Filter using the guide ranges
filtcounts <- dualFilter(
  x = wincounts[, samples$H3K27ac$accession],
  bg  =  wincounts[, unique(samples$H3K27ac$input)],
  ref = guide_ranges, q = 0.6
)
## Update the colData
colData(filtcounts) <- colData(filtcounts) %>% 
  as_tibble(rownames = "accession") %>% 
  left_join(samples$H3K27ac) %>% 
  as("DataFrame")
dim(filtcounts)
[1] 19017     6
assays(filtcounts)
List of length 2
names(2): counts logCPM

fitAssayDiff()

fitAssayDiff(
  x, assay = "counts",
  design = NULL, coef = NULL,
  lib.size = "totals",
  method = c("qlf", "lt"),
  norm = c("none", "TMM", "RLE", "TMMwsp", "upperquartile"),
  groups = NULL,
  fc = 1, lfc = log2(fc),
  asRanges = FALSE,
  offset = NULL,
  null = c("interval", "worst.case"),
  weighted = FALSE,
  ...
)

Fixed-Width Windows

## Setup the design matrix
X <- model.matrix(~treatment, data = colData(er_se))
## Fit the model
er_results <- fitAssayDiff(
  er_se, design = X, norm = "TMM", asRanges = TRUE, fc = 1.2
) %>% 
  addDiffStatus()
## Print the significant ranges
er_results %>% 
  as_tibble() %>% 
  dplyr::filter(FDR < 0.05) %>% 
  dplyr::select(-gene_id) %>% 
  mutate(
    gene_name = vapply(gene_name, paste, character(1), collapse = "; ")
  ) %>% 
  arrange(PValue) 
# A tibble: 7 × 8
  range                   region  gene_name logFC logCPM   PValue     FDR status
  <chr>                   <fct>   <chr>     <dbl>  <dbl>    <dbl>   <dbl> <fct> 
1 chr10:81101906-81102928 Upstre… PPIF      1.91    8.03 4.43e-11 8.34e-9 Incre…
2 chr10:79629641-79630271 Promot… DLG5      0.967   8.23 4.20e- 6 3.95e-4 Incre…
3 chr10:89407752-89408138 Intron  ENSG0000… 1.71    6.20 1.08e- 5 6.79e-4 Incre…
4 chr10:52233596-52233998 Exon    SGMS1; E… 1.13    6.70 2.37e- 4 1.11e-2 Incre…
5 chr10:51547205-51547782 Promot… MSMB      0.665   7.44 1.12e- 3 4.22e-2 Incre…
6 chr10:79266991-79267392 Intron  KCNMA1    1.24    5.56 1.58e- 3 4.52e-2 Incre…
7 chr10:71879941-71880280 Promot… AIFM2     1.20    5.68 1.68e- 3 4.52e-2 Incre…

plotSplitDonut

plotSplitDonut

## Define inner & outer palettes
region_colours <- hcl.colors(length(region_levels), "viridis", rev = TRUE)
names(region_colours) <- region_levels
status_colours <- c(
  Increased = "red3", Decreased = "royalblue3", Unchanged = "grey40",
  Undetected = "grey70"
)
plotSplitDonut(
  er_results, inner = "region", outer = "status",
  inner_palette = region_colours, outer_palette = status_colours,
  inner_glue = "{str_wrap(.data[[inner]],15)}\nn = {n}", inner_label_alpha = 0.8,
  outer_glue = "{str_wrap(.data[[inner]],15)}\nER {.data[[outer]]}\nn = {n}",
  outer_max_p = 0.03, outer_min_p = 0,
  explode_outer = "Increased", explode_r = 0.2
)
  • Can set inner and outer palettes separately
  • Labels are customisable using glue
  • Explode segments based regex queries

Sliding Windows

  • Involves the additional step of merging sliding windows
X <- model.matrix(~treatment, data = colData(filtcounts))
h3k_all <- fitAssayDiff(filtcounts, norm = "TMM", design = X, asRanges = TRUE)
h3k_results <- mergeByHMP(
  h3k_all, min_win = 2, merge_within = win_size + 1, keyval = "merged"
) %>% 
  addDiffStatus()
arrange(h3k_results, hmp)[1:5]
GRanges object with 5 ranges and 9 metadata columns:
      seqnames            ranges strand | n_windows      n_up    n_down
         <Rle>         <IRanges>  <Rle> | <integer> <integer> <integer>
  [1]    chr10 79256601-79258700      * |        35         4         0
  [2]    chr10 43689201-43690150      * |        17         1         0
  [3]    chr10 74008151-74008750      * |        10         2         0
  [4]    chr10 63859401-63859750      * |         5         1         0
  [5]    chr10 79623301-79635950      * |       251         7         0
                 keyval_range    logCPM     logFC         hmp     hmp_fdr
                    <GRanges> <numeric> <numeric>   <numeric>   <numeric>
  [1] chr10:79257751-79258050   7.12950  1.693320 5.67542e-15 3.64930e-12
  [2] chr10:43689651-43689800   6.69278  1.866274 2.74260e-12 8.81747e-10
  [3] chr10:74008301-74008500   6.54608  1.735779 2.67001e-10 5.72273e-08
  [4] chr10:63859601-63859750   6.03721  1.848069 5.65428e-08 7.76644e-06
  [5] chr10:79630151-79630650   7.87510  0.905969 6.03922e-08 7.76644e-06
         status
       <factor>
  [1] Increased
  [2] Increased
  [3] Increased
  [4] Increased
  [5] Increased
  -------
  seqinfo: 1 sequence from hg19 genome

Map Results

  • Map ranges after analysis
h3k_results <- h3k_results %>% 
  mutate(
    region = bestOverlap(.$keyval_range, unlist(regions), var = "region")
  ) %>% 
  mapByFeature(genes = gtf$gene, prom = regions$promoter)
arrange(h3k_results, hmp)[1:5]
GRanges object with 5 ranges and 12 metadata columns:
      seqnames            ranges strand | n_windows      n_up    n_down
         <Rle>         <IRanges>  <Rle> | <integer> <integer> <integer>
  [1]    chr10 79256601-79258700      * |        35         4         0
  [2]    chr10 43689201-43690150      * |        17         1         0
  [3]    chr10 74008151-74008750      * |        10         2         0
  [4]    chr10 63859401-63859750      * |         5         1         0
  [5]    chr10 79623301-79635950      * |       251         7         0
                 keyval_range    logCPM     logFC         hmp     hmp_fdr
                    <GRanges> <numeric> <numeric>   <numeric>   <numeric>
  [1] chr10:79257751-79258050   7.12950  1.693320 5.67542e-15 3.64930e-12
  [2] chr10:43689651-43689800   6.69278  1.866274 2.74260e-12 8.81747e-10
  [3] chr10:74008301-74008500   6.54608  1.735779 2.67001e-10 5.72273e-08
  [4] chr10:63859601-63859750   6.03721  1.848069 5.65428e-08 7.76644e-06
  [5] chr10:79630151-79630650   7.87510  0.905969 6.03922e-08 7.76644e-06
         status                region                                   gene_id
       <factor>           <character>                           <CharacterList>
  [1] Increased                Intron                     ENSG00000156113.25_17
  [2] Increased    Intergenic (<10kb)                     ENSG00000198915.12_14
  [3] Increased    Intergenic (>10kb)                       ENSG00000166295.9_6
  [4] Increased    Intergenic (<10kb)                     ENSG00000150347.17_12
  [5] Increased Promoter (-2500/+500) ENSG00000204049.1_8,ENSG00000151208.17_11
                 gene_name
           <CharacterList>
  [1]               KCNMA1
  [2]             RASGEF1A
  [3]              ANAPC16
  [4]               ARID5B
  [5] ENSG00000204049,DLG5
  -------
  seqinfo: 1 sequence from hg19 genome

Comparing Results

Comparing Two Targets

  • Check where change is found in both/either/neither
    • mapGrlCols()
  • Plot combined status
    • plotSplitDonut()
  • Compare binding patterns:
    • Both signal (logCPM) and change (logFC)
    • plotPairwise()
  • Inspect Genomic Regions of interest
    • plotHFGC()

Comparing Two Targets

  • Comparison functions utilise GRangesList objects
  • Require shared column names
all_results <- GRangesList(
  ER = er_results,
  H3K27ac = h3k_results %>% 
    select(
      region, starts_with("gene"), starts_with("log"), 
      PValue = hmp, FDR = hmp_fdr, status
    )
)

mapGrlCols

  • Columns from each element can be mapped
    • Columns can be combined and collapsed into a single column
  • Viable for a named GRangesList of any size
all_results %>% 
  mapGrlCols(var = c("logFC", "status"), collapse = "gene_name") 
GRanges object with 741 ranges and 5 metadata columns:
        seqnames            ranges strand |  ER_logFC H3K27ac_logFC ER_status
           <Rle>         <IRanges>  <Rle> | <numeric>     <numeric>  <factor>
    [1]    chr10 42862951-42863350      * |        NA     0.3345653 NA       
    [2]    chr10 43047801-43048800      * | -0.301091    -0.0780895 Unchanged
    [3]    chr10 43047801-43048800      * | -0.301091    -0.2619553 Unchanged
    [4]    chr10 43132901-43134800      * |        NA     0.1350775 NA       
    [5]    chr10 43135701-43137250      * |        NA     0.0791231 NA       
    ...      ...               ...    ... .       ...           ...       ...
  [737]    chr10 99496401-99497850      * |        NA     0.0551099 NA       
  [738]    chr10 99608301-99610050      * |        NA    -0.5008959 NA       
  [739]    chr10 99621632-99621988      * |  0.115139            NA Unchanged
  [740]    chr10 99628751-99629400      * |        NA    -0.2298606 NA       
  [741]    chr10 99894101-99895400      * |        NA    -0.1143540 NA       
        H3K27ac_status              gene_name
              <factor>            <character>
    [1]      Unchanged        ENSG00000291065
    [2]      Unchanged       FXYD6P1; ZNF37BP
    [3]      Unchanged       FXYD6P1; ZNF37BP
    [4]      Unchanged ENSG00000285884; ZNF..
    [5]      Unchanged ENSG00000285884; ZNF..
    ...            ...                    ...
  [737]      Unchanged                ZFYVE27
  [738]      Unchanged    GOLGA7B; GOLGA7B-DT
  [739]      NA                       GOLGA7B
  [740]      Unchanged        CRTAC1; GOLGA7B
  [741]      Unchanged                R3HCC1L
  -------
  seqinfo: 1 sequence from hg19 genome

mapGrlCols

  • Columns from each element can be mapped
    • Columns can be combined and collapsed into a single column
  • Makes finding sites changed for both comparisons simple
all_results %>% 
  mapGrlCols(var = c("logFC", "status"), collapse = "gene_name") %>% 
  filter(
    str_detect(ER_status, "(In|De)creased"), str_detect(H3K27ac_status, "(In|De)creased")
  ) 
GRanges object with 3 ranges and 5 metadata columns:
      seqnames            ranges strand |  ER_logFC H3K27ac_logFC ER_status
         <Rle>         <IRanges>  <Rle> | <numeric>     <numeric>  <factor>
  [1]    chr10 71879101-71881150      * |   1.20340      0.810052 Increased
  [2]    chr10 79623301-79635950      * |   0.96745      0.905969 Increased
  [3]    chr10 81101651-81103150      * |   1.90628      1.577851 Increased
      H3K27ac_status             gene_name
            <factor>           <character>
  [1]      Increased                 AIFM2
  [2]      Increased DLG5; ENSG00000204049
  [3]      Increased                  PPIF
  -------
  seqinfo: 1 sequence from hg19 genome

plotPairwise: logCPM

plotPairwise(all_results, var = "logCPM")

plotPairwise: logFC

plotPairwise: logFC

plotPairwise(
  all_results, var = "logFC", colour = "status", label = "gene_name"
) +
  scale_colour_manual(
    values = c("lightskyblue", "palevioletred", "grey", "darkred")
  ) +
  scale_fill_manual(
    values = c(
      setNames(status_colours, paste("ER", names(status_colours))),
      setNames(status_colours, paste("H3K27ac", names(status_colours)))
    )
  )
  • Colouring by status can be informative
  • The furthest point from zero (per group) can be labelled
  • Point colours and boxplot fills can be set using ggplot2 methods

plotHFGC

  • Plots multiple tracks using Gviz (Hahne and Ivanek 2016)
    • HiC (or other interactions)
    • Features of any type
    • Gene models
    • Coverage
  • All tracks are optional
  • Cytogenetic bands for GRCh37/38 are included in the package
data("grch37.cytobands")

plotHFGC

HiC tracks

  • GInteractions object

Feature tracks

  • A named GRangesList \(\implies\) single track coloured by element
    • Output from defineRegions()
  • A named list of (named) GRangesList objects \(\implies\) multiple tracks

Gene Tracks

  • GRanges object for a single track
    • Easily formed from exon-level annotations
  • Named GRangesList will draw multiple tracks (Up, Down etc.)

plotHFGC

Coverage tracks

  • A BigWigFileList \(\implies\) each file as a separate track
  • A list of BigwigFileList objects
    • Will overlay files within each BigWigFileList
    • Each list element on a separate track
  • Can also annotate coverage tracks (e.g. Increased, Decreased etc)

plotHFGC

plotHFGC

gr <- all_results %>% 
  mapGrlCols(var = "status", collapse = "gene_name") %>% 
  filter(
    ER_status %in% c("Increased", "Decreased"), 
    H3K27ac_status %in% c("Increased", "Decreased"), 
    gene_name == "PPIF"
  )
plotHFGC(
  gr, cytobands = grch37.cytobands,
  features = regions, featcol = region_colours,
  genes = gene_models,
  coverage = cov_bwfl, linecol = line_col, 
  annotation = cov_annot, annotcol = status_colours,
  zoom = 20
)
  • Once all objects defined:
    • Inspect multiple ranges quickly & easily
    • Zoom in/out as needed

References

Chen, Yunshun, Aaron A T Lun, and Gordon K Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.” F1000Research 5: 1438. https://doi.org/10.12688/f1000research.8987.2.
Hahne, Florian, and Robert Ivanek. 2016. “Statistical Genomics: Methods and Protocols.” In, edited by Ewy Mathé and Sean Davis, 335–51. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4939-3578-9_16.
Hickey, Theresa E, Luke A Selth, Kee Ming Chia, Geraldine Laven-Law, Heloisa H Milioli, Daniel Roden, Shalini Jindal, et al. 2021. “The Androgen Receptor Is a Tumor Suppressor in Estrogen Receptor-Positive Breast Cancer.” Nat. Med. 27 (2): 310–20.
Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts.” Genome Biol. 15 (2): R29.
Lee, Stuart, Cook, Dianne, Lawrence, and Michael. 2019. “Plyranges: A Grammar of Genomic Data Transformation.” Genome Biol. 20 (1): 4. http://dx.doi.org/10.1186/s13059-018-1597-8.
Lun, Aaron T L, and Gordon K Smyth. 2016. “Csaw: A Bioconductor Package for Differential Binding Analysis of ChIP-Seq Data Using Sliding Windows.” Nucleic Acids Res. 44 (5): e45.
McCarthy, Davis J, and Gordon K Smyth. 2009. “Testing Significance Relative to a Fold-Change Threshold Is a TREAT.” Bioinformatics 25 (6): 765–71.
Ross-Innes, Caryn S., Rory Stark, Andrew E. Teschendorff, Kelly A. Holmes, H. Raza Ali, Mark J. Dunning, Gordon D. Brown, et al. 2012. “Differential Oestrogen Receptor Binding Is Associated with Clinical Outcome in Breast Cancer.” Nature 481: –4. http://www.nature.com/nature/journal/v481/n7381/full/nature10730.html.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wilson, Daniel J. 2019. “The Harmonic Mean p-Value for Combining Dependent Tests.” Proc. Natl. Acad. Sci. U. S. A. 116 (4): 1195–1200.