extraChIPs: Going Beyond GRAVI

Stephen (Stevie) Pederson

Black Ochre Data Laboratories, Telethon Kids Institute

March 17, 2023

Who Am I

Professionally

  • 2022- Post-doctoral Bioinformatician: Black Ochre Data Laboratories
  • 2020-2022 Dame Roma Mitchell Cancer Research Laboratories
  • 2014-2020 Bioinformatics Hub, University of Adelaide
  • 2008-2018 PhD Candidate, Barry Immunology Group
  • 1991-2008 Musician

Research Interests

  • Transcriptomics, Gene Regulation, Applied Statistics
  • R programming, Bioconductor Nerd

Bioconductor Packages

  1. ngsReports  
    • Parse & plot output from FastQC, cutadapt, STAR, macs2 etc.
  2. extraChIPs  
    • ChIP-Seq Analysis & Visualisation
    • Manipulation of GRanges objects
    • Built as infrastructure for the GRAVI workflow

But first…

Source: https://www.dailymail.co.uk/news/article-9744517/Elephant-smashed-wall-steal-snack-squeezes-family-home-eat-cat-food.html

extraChIPs

extraChIPs

  1. A walk-through of the package
    • Usage well beyond ChIP-Seq data
  2. Approaches to detection of Differential ChIP-Seq Signal
  • extraChIPs currently up on Bioconductor
  • Latest version is currently only on github
  • Bioconductor 3.17 Release on 26th April
    • R 4.3.0 on Friday 21st April

Utility Functions

Utility Functions

  • Genomic Ranges objects (GRanges) are analogous to bed files
  • Most functions for GRanges focus on the ranges component
  • My focus was on the data within the mcols() element
  • How can we play nicely with the tidyverse?
    • Not treading on the toes of plyranges
    • \(\implies\) enhancing approaches implemented by plyranges

tibble Coercion

## S3 method for class 'DataFrame'
as_tibble(x, rangeAsChar = TRUE, ...)

## S3 method for class 'GenomicRanges'
as_tibble(x, rangeAsChar = TRUE, name = "range", ...)

## S3 method for class 'Seqinfo'
as_tibble(x, ...)

## S3 method for class 'GInteractions'
as_tibble(x, rangeAsChar = TRUE, suffix = c(".x", ".y"), ...)


  • Very handy for interacting with ggplot2 and dplyr etc
  • Handles S4 Compressed list columns well (so far)
    • Uses vctrs::vec_proxy() to coerce to S3 lists
  • Reverse coercion also implemented

tibble Coercion

Starting with the protein-coding transcripts for CTLA4

gr
GRanges object with 4 ranges and 2 metadata columns:
      seqnames              ranges strand |   gene_name transcript_name
         <Rle>           <IRanges>  <Rle> | <character>     <character>
  [1]     chr2 204732494-204738688      + |       CTLA4       CTLA4-205
  [2]     chr2 204732666-204737498      + |       CTLA4       CTLA4-201
  [3]     chr2 204732666-204737535      + |       CTLA4       CTLA4-204
  [4]     chr2 204732666-204737535      + |       CTLA4       CTLA4-203
  -------
  seqinfo: 24 sequences from GRCh37 genome
sq <- seqinfo(gr)
sq
Seqinfo object with 24 sequences from GRCh37 genome:
  seqnames seqlengths isCircular genome
  chr1      249250621      FALSE GRCh37
  chr2      243199373      FALSE GRCh37
  chr3      198022430      FALSE GRCh37
  chr4      191154276      FALSE GRCh37
  chr5      180915260      FALSE GRCh37
  ...             ...        ...    ...
  chr20      63025520      FALSE GRCh37
  chr21      48129895      FALSE GRCh37
  chr22      51304566      FALSE GRCh37
  chrX      155270560      FALSE GRCh37
  chrY       59373566      FALSE GRCh37

tibble Coercion

Now perform the coercion using as_tibble()

tbl <- as_tibble(gr)
tbl
# A tibble: 4 × 3
  range                      gene_name transcript_name
  <chr>                      <chr>     <chr>          
1 chr2:204732494-204738688:+ CTLA4     CTLA4-205      
2 chr2:204732666-204737498:+ CTLA4     CTLA4-201      
3 chr2:204732666-204737535:+ CTLA4     CTLA4-204      
4 chr2:204732666-204737535:+ CTLA4     CTLA4-203      

Coerce back to a GRanges uses colToRanges()

colToRanges(tbl, var = "range", seqinfo = sq)
GRanges object with 4 ranges and 2 metadata columns:
      seqnames              ranges strand |   gene_name transcript_name
         <Rle>           <IRanges>  <Rle> | <character>     <character>
  [1]     chr2 204732494-204738688      + |       CTLA4       CTLA4-205
  [2]     chr2 204732666-204737498      + |       CTLA4       CTLA4-201
  [3]     chr2 204732666-204737535      + |       CTLA4       CTLA4-204
  [4]     chr2 204732666-204737535      + |       CTLA4       CTLA4-203
  -------
  seqinfo: 24 sequences from GRCh37 genome

*MC() Functions

  • Many GRanges functions just operate on the range
    • reduce, intersect, setdiff, union
  • extraChIPs variants also return the mcols element
    • reduceMC, intersectMC, setdiffMC, unionMC
    • distinctMC and chopMC replicate tidyverse functions
  • There is a performance overhead… 😞

reduceMC()

Say we wish to find the TSS for CTLA transcripts

gr %>% 
  resize(fix = 'start', width = 1) 
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |   gene_name transcript_name
         <Rle> <IRanges>  <Rle> | <character>     <character>
  [1]     chr2 204732494      + |       CTLA4       CTLA4-205
  [2]     chr2 204732666      + |       CTLA4       CTLA4-201
  [3]     chr2 204732666      + |       CTLA4       CTLA4-204
  [4]     chr2 204732666      + |       CTLA4       CTLA4-203
  -------
  seqinfo: 24 sequences from GRCh37 genome


reduce will only return the ranges

gr %>% 
  resize(fix = 'start', width = 1) %>% 
  GenomicRanges::reduce()
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2 204732494      +
  [2]     chr2 204732666      +
  -------
  seqinfo: 24 sequences from GRCh37 genome

reduceMC()

Say we wish to find the TSS for CTLA transcripts

gr %>% 
  resize(fix = 'start', width = 1) 
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |   gene_name transcript_name
         <Rle> <IRanges>  <Rle> | <character>     <character>
  [1]     chr2 204732494      + |       CTLA4       CTLA4-205
  [2]     chr2 204732666      + |       CTLA4       CTLA4-201
  [3]     chr2 204732666      + |       CTLA4       CTLA4-204
  [4]     chr2 204732666      + |       CTLA4       CTLA4-203
  -------
  seqinfo: 24 sequences from GRCh37 genome


reduceMC() returns the mcols as well

gr %>% 
  resize(fix = 'start', width = 1) %>% 
  reduceMC()
GRanges object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |   gene_name               transcript_name
         <Rle> <IRanges>  <Rle> | <character>               <CharacterList>
  [1]     chr2 204732494      + |       CTLA4                     CTLA4-205
  [2]     chr2 204732666      + |       CTLA4 CTLA4-201,CTLA4-204,CTLA4-203
  -------
  seqinfo: 24 sequences from GRCh37 genome

intersectMC()

  • Find the intersecting range for a ChIP-Seq peak
peak
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr2 204732401-204732600      *
  -------
  seqinfo: 24 sequences from GRCh37 genome
intersect(gr, peak, ignore.strand = TRUE)
GRanges object with 1 range and 0 metadata columns:
      seqnames              ranges strand
         <Rle>           <IRanges>  <Rle>
  [1]     chr2 204732494-204732600      *
  -------
  seqinfo: 24 sequences from GRCh37 genome
intersectMC(gr, peak, ignore.strand = TRUE)
GRanges object with 1 range and 2 metadata columns:
      seqnames              ranges strand |   gene_name transcript_name
         <Rle>           <IRanges>  <Rle> | <character>     <character>
  [1]     chr2 204732494-204732600      * |       CTLA4       CTLA4-205
  -------
  seqinfo: 24 sequences from GRCh37 genome
  • NB: subsetByOverlaps() would return the entire initial range
  • Only the mcols from the query range are returned

setdiffMC()

  • Here we can chop out ranges
  • Find the region of CTLA4 not covered by the 3 shorter transcripts
gr
GRanges object with 4 ranges and 2 metadata columns:
      seqnames              ranges strand |   gene_name transcript_name
         <Rle>           <IRanges>  <Rle> | <character>     <character>
  [1]     chr2 204732494-204738688      + |       CTLA4       CTLA4-205
  [2]     chr2 204732666-204737498      + |       CTLA4       CTLA4-201
  [3]     chr2 204732666-204737535      + |       CTLA4       CTLA4-204
  [4]     chr2 204732666-204737535      + |       CTLA4       CTLA4-203
  -------
  seqinfo: 24 sequences from GRCh37 genome


setdiffMC(gr[1], gr[2:4])
GRanges object with 2 ranges and 2 metadata columns:
      seqnames              ranges strand |   gene_name transcript_name
         <Rle>           <IRanges>  <Rle> | <character>     <character>
  [1]     chr2 204732494-204732665      + |       CTLA4       CTLA4-205
  [2]     chr2 204737536-204738688      + |       CTLA4       CTLA4-205
  -------
  seqinfo: 24 sequences from GRCh37 genome


  • Again, we retain the mcols information

distinctMC()

  • Very handy after a plyranges::join_*() with multiple matches
meth
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand | prop_methylated
         <Rle> <IRanges>  <Rle> |       <numeric>
  [1]     chr2 204732499      * |            0.20
  [2]     chr2 204732562      * |            0.99
  -------
  seqinfo: 24 sequences from GRCh37 genome
  • The join will expand to both methylated sites
peak %>% 
  plyranges::join_overlap_left(meth) 
GRanges object with 2 ranges and 1 metadata column:
      seqnames              ranges strand | prop_methylated
         <Rle>           <IRanges>  <Rle> |       <numeric>
  [1]     chr2 204732401-204732600      * |            0.20
  [2]     chr2 204732401-204732600      * |            0.99
  -------
  seqinfo: 24 sequences from GRCh37 genome

distinctMC()

  • Very handy after a plyranges::join_*() with multiple matches
meth
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand | prop_methylated
         <Rle> <IRanges>  <Rle> |       <numeric>
  [1]     chr2 204732499      * |            0.20
  [2]     chr2 204732562      * |            0.99
  -------
  seqinfo: 24 sequences from GRCh37 genome
  • We can just keep the first mapping after sorting by some parameter
peak %>% 
  plyranges::join_overlap_left(meth) %>% 
  arrange(1/(prop_methylated)) %>% 
  distinctMC(.keep_all = TRUE)
GRanges object with 1 range and 1 metadata column:
      seqnames              ranges strand | prop_methylated
         <Rle>           <IRanges>  <Rle> |       <numeric>
  [1]     chr2 204732401-204732600      * |            0.99
  -------
  seqinfo: 24 sequences from GRCh37 genome

chopMC()

  • Wraps tidyr::chop()
    • Chops by any specified column, always including the range
    • Could then run vapply() on the column to summarise
peak %>% 
  plyranges::join_overlap_left(meth) %>% 
  chopMC()
GRanges object with 1 range and 1 metadata column:
      seqnames              ranges strand | prop_methylated
         <Rle>           <IRanges>  <Rle> |   <NumericList>
  [1]     chr2 204732401-204732600      * |       0.20,0.99
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths


Thank goodness we’re not from New Zealand…

Visualisation

Visualisations

  • extraChIPs was written for the GRAVI workflow
  • Added functions for common visualisations in the context of ChIP-Seq
  • Commonly use a SummarizedExperiment object with multiple ‘assays’
    • Usually have a rowRanges element
  • Most plotting functions built for GRanges / data.frame-like objects

Visualisations

  • plotAssay*() for inspection of pre-analysed data
    • plotAssayDensities(), plotAssayPCA(), plotAssayRle()
  • plotOverlaps(), plotPie() and plotSplitDonut()
    • Compare sets of ranges using Venn/UpSet, Pie or Donut charts
  • plotProfileHeatmap()
    • Visualise binding intensities at multiple sites/samples
  • plotHFGC(): plot HiC, Features, Genes, Coverage
    • All elements are optional
    • Define your objects \(\implies\) keep running across multiple ranges

plotProfileHeatmap()

  1. Provide a set of ranges
  2. Centre & standardise width
  3. Read in data from BigWigFileList
    • getProfileData(bwfl, gr)
  4. Plot data
  • Uses ggplot2 for easy customisation
    • Top panel plots using ggside
  • Still frustratingly slow… 😞

plotHFCG()

  • Wraps plotting functions from Gviz
  • HiC as GenomicInteractions object
  • Each Feature track as a GRanges within a named list
  • Each Gene Track as a GRanges within a named list
  • Each coverage track is a BigWigFileList
    • Pass a list of BigWigFileList objects for multiple tracks
    • Coverage tracks can also have GRanges to annotate them

plotHFGC()

Here I’m checking binding for a DE gene (C1orf116)…

plotPie()

plotPie(consensus_peaks, fill = "region") +
  scale_fill_manual(values = colours$regions) +
  theme(legend.position = "none")

AR binding sites mapped to genomic regions

plotSplitDonut()

  • Compares two columns from a GRanges or data.frame object
  • Can set inner & outer palettes separately
  • Huge flexibility in labelling
  • Uses patchwork to drive the layout
  • Getting legend positions right is still a bit tricky (i.e. hacky…)

plotSplitDonut()

plotSplitDonut(

  object, inner, outer, scale_by = NULL,
  r_centre = 0.5, r_inner = 1, r_outer = 1,
  
  total_size = 5, total_glue = "{comma(N)}",
  
  inner_glue = "{inner} {.data[[inner]]}\n{percent(p,0.1)}",
  outer_glue = "{outer} {.data[[outer]]}\n{percent(p,0.1)}",
  inner_label = c("label", "text", "none"),
  outer_label = c("label", "text", "none"),
  
  label_alpha = 1, inner_label_alpha = NULL, outer_label_alpha = NULL,
  label_size = 3, inner_label_size = NULL, outer_label_size = NULL,
  min_p = 0.05, inner_min_p = NULL, outer_min_p = NULL,
  
  explode_inner = NULL, explode_outer = NULL, explode_query = c("AND", "OR"),
  explode_x = 0, explode_y = 0, explode_r = 0, nudge_r = 0.5,
  
  expand = 0.1,
  inner_palette = NULL, outer_palette = NULL,
  inner_legend = TRUE, outer_legend = TRUE,
  
  layout = c(main = area(1, 1, 6, 6), lg1 = area(2, 7), lg2 = area(4, 7)), ...
)
  • Need to add label & tweak how label size is set globally

plotSplitDonut()

target <- "GATA3
merged_results %>% 
  plotSplitDonut(
    inner = target, outer = "Region", min_p = 0.01,
    inner_glue = "{.data[[inner]]}\nn = {comma(n, 1)}\n{percent(p, 0.1)}",
    outer_glue = "{str_wrap(.data[[outer]], 12)}\nn = {comma(n, 1)}\n{percent(p, 0.1)}",
    inner_palette = direction_colours, outer_palette = region_colours
  )

plotSplitDonut()

all_windows %>% 
  plotSplitDonut(
    inner = colnames(.)[[1]], outer = colnames(.)[[2]], min_p =0.02, 
    inner_glue = "{inner}\n{.data[[inner]]}\n{percent(p, 0.1)}",
    outer_glue = "{outer}\n{.data[[outer]]}\n{percent(p, 0.1)}",
    explode_inner = "Down|Up", explode_outer = "Down|Up", explode_r = 0.4,
    explode_query = "OR", nudge_r = 0.5,
    inner_palette = colours$direction, outer_palette = colours$direction
    inner_legend = FALSE, outer_legend = FALSE
  ) 

plotOverlaps()

  • Calls 1) ComplexUpset::upset() or 2) VennDiagram::draw.*wise.venn()
  • Size-scaled Venn Diagrams are tricky in R
ex <- list(x = letters[1:5], y = letters[c(6:15, 26)], z = letters[c(2, 10:25)])
plotOverlaps(ex, type = "upset", set_col = 1:3, labeller = stringr::str_to_title)

Differential ChIP Signal

ChIP-Seq

  • A protein of interest is:
    1. Cross-linked to DNA
    2. Immuno-Precipitated using antibody for ChIP target
    3. Cross-links reversed \(\implies\) DNA sequenced
  • Sequences should be enriched for DNA-bound regions
  • ChIP targets can be transcription factors, histone marks (e.g. H3K27ac)
    • TFs are generally shorter (300-500bp) regions
    • Histone marks can be far broader (>1kb)
  • Can be quantitative or have informative sequences

Differential Signal/Binding

  • Looking for increase/decrease in binding (or signal) at a genomic region
  • Fundamentally different to RNA-Seq:
    1. Only 0/1/2 fragments per cell
    2. Increase in signal \(\implies\) Target bound in more cells at that site

Differential Binding

  • Forms a key part of the GRAVI workflow
  • GRAVI is to run multiple ChIP targets
    • Need a method which is robust when automatically deployed
    • Normalisation becomes important
  • Can handle targets which are:
    • Cytoplasmic \(\rightarrow\) Nuclear
    • Partially sequestered \(\rightarrow\) More able to bind
    • Just relocated on the DNA
  • Also robust for H3K27ac marks etc

DRMCRL Data

Usually DHT Vs Veh or E2+DHT Vs E2

  • AR shifts to nucleus with DHT-treatment
  • AR relocates ER on the chromosome (Hickey et al. 2021)
  • Does AR bring some/any GATA3 to the nucleus with it?
  • How does this relate to H3K27ac signal changes
    • Much broader peaks to transcription factors
    • Include Differential H3K27ac signal alongside AR/ER/GATA3

Can I develop a differential signal workflow for all situations?

DiffBind

  • Is the most common method for ChIP-Seq analysis (Ross-Innes et al. 2012)
  • Uses defined peaks (e.g. macs2 consensus peaks)
    • Re-centres peaks when loading
    • Fixed width across all peaks (e.g. all 501bp)
  • Defaults to:
    • Library Size Normalisation, i.e log(counts/lib.size) ~ X
    • DESeq2 (just like RNA-Seq)
    • Can also use edgeR & choose alternative normalisation

csaw

  • Developed by Gordon Smyth & Aaron Lun (Lun and Smyth 2016)
  • Uses sliding windows
    • Avoids bias from counting within defined peaks (Lun and Smyth 2014)
    • Regions defined independently of signal
  • After counting across genome \(\implies\) throw away low/no signal windows
  1. No normalisation method by default (we choose)
  2. Statistical Test on remaining windows (edgeR::glmQLF())
  3. Summarise tests across overlapping windows \(\implies\) merged region
    • Tests/p-values are clearly dependent
    • Can end up with tighter and broader regions than DiffBind

An Example

Use the maximal signal window?

Merging Windows

csaw offers multiple options

  1. Take window with highest signal as representative of entire region
  2. Choose the window with the most extreme logFC
    • Often low signal windows
  3. Choose lowest p-value
    • Needs careful adjustment after merging
  4. Use Simes’ Method to merge all \(p\)-values
    • Chooses logFC from lowest p-value as representative

extraChIPs: Window Merging

  • All csaw methods wrapped for slightly different output structure
  • Merge using harmonic-mean p-value (Wilson 2019)
    • \(\text{hmp} =\frac{n}{\sum_i^np_i^{-1}}\) if no weights provided
    • \(\text{hmp} =\frac{\sum_{i=1}^nw_i}{\sum_i^nw_i p_i^{-1}}\) if weights provided
    • Actually uses the asymptotically exact harmonic mean
    • Spent the weekend getting a 100x speed up from harmonicmeanp

extraChIPs: Window Merging

  • Robust to dependence
    • Controls FDR within a set of tests
    • Can also produce strict FWER controlled version for complete dataset
  • Return weighted-mean of logFC/logCPM using \(p_i^{-1}\) for weights
    • e.g. \(\widehat{\text{logFC}} = \frac{\sum_i^np_i^{-1}\text{logFC}_i}{\sum_i^np_1^{-1}}\)

extraChIPs: Window Merging

  • Outperforms Simes’ consistently

Results Example 1

A ~3kb H3K27ac region unique to harmonic mean



  • \(-\log_{10}p\) in the top panel
    • Dashed line is \(-\log_{10}(\text{hmp})\)

  • Average Signal is the middle panel
    • Dashed line is weighted mean

  • logFC in bottom panel
    • Dashed line is weighted mean
  • Vertical line is maximal signal

Normalisation Methods

  • Library Size only lacks significant power
    • No attempt to handle technical variability
  • TMM/RLE are both ok if overall signal is the same in all treatments
  • DRMCRL data has pooled input: normalise to input
  • Is there a general method for automating in a workflow ?

My SuggestionL SQLT

  1. Calculate logCPM
    • Effectively normalises by library size
  1. Smooth Quantile Normalisation (Stephanie C. Hicks et al. 2018)
    • Generalisation of quantile normalisation within groups
    • Reduces technical variability
  1. limma-trend (Law et al. 2014) using normalised logCPM
  1. Merge Windows using Harmonic Mean \(p\)-value (extraChIPs)

Smooth Quantile Normalisation

  • Quantile Normalisation was standard practice for microarrays
    • Gives all samples the identical distribution
    • Reduces technical variability between samples
  • Smooth-QN allows for treatment groups
    • Uses weights to determine areas which diverge between groups
    • Fits a weighted average of expected values

SQN

The Cytoplasmic-Nuclear Shift is clear.
(quantro: \(p_{med} = 0.002\); \(p_{dist} < 5e4\))

Low weights = different distributions

Looks like replicate variability.
(quantro: \(p_{med} = 0.91\); \(p_{dist} = 0.42\))

High weights = same distributions

Is any being brought to the nucleus?
(quantro: \(p_{med} = 0.65\); \(p_{dist} < 5e4\))

Mostly similar distributions

There’s an outlier sample.
(quantro: \(p_{med} = 0.84\); \(p_{dist} = 0.49\))

Mostly the same distributions

SQN: PCA

Not much change on the PCA

Now treatment groups are the biggest source of variability

Now treatment groups are the biggest source of variability

Some rotation…

SQN

  • Reducing replicate variability should yield more power
    • Should be a clear improvement over library size alone?
  • limma-trend similar to voom
    • Uses logCPM
    • Manages mean-variance relationship
  • SQN appears viable even when no real difference between groups
    • Doesn’t seem to introduce any artefacts
  • Could TMM also be applied within groups?

Method Comparison

Performed a comparison using

  1. LSQL: Library Size \(\rightarrow\) edgeR::glmQLF()
  2. TMM: (across all samples) \(\rightarrow\) edgeR::glmQLF()
  3. TMM-Groups: (within groups) \(\rightarrow\) edgeR::glmQLF()
  1. LT: logCPM \(\implies\) limma-trend
  2. SQLT: qsmooth \(\rightarrow\) limma-trend

Performed using

  1. \(H_0\) logFC = 0
  2. \(H_0\) |logFC| < \(\log_2 1.2\) (McCarthy and Smyth 2009)

All merged results using harmonic mean

Overall Results

Group-based methods are best. TMM clearly inappropriate

Library-Size lacks significant power

Library-Size again lacks power

Smooth Quantile now loses out

Directional Results

TMM decreasing sites are obviously errors

LSQL (Library Size Only) found no down sites. No clarity on TMM vs TMM-Groups

SQLT seems to have about half the power
Similar but far less pronounced in 2 other datasets

All Good So Far

  • Group-based strategies looked best where expected
    • SQLT appears poor for H3K27ac (??)
  • How do the methods compare when using treat (McCarthy and Smyth 2009)
    • \(H_0\) |logFC| < \(\log_2 1.2\)
  • Known to control the FDR better than post-hoc filtering using logFC
  • There was a drastic difference
    • SQLT lost significant power compared to TMM-groups
    • Had been competitive so far

GATA3

\(H_0\): logFC = 0

  • SQLT has the most decreasing sites
  • Less competitive for increasing

GATA3

Range-based \(H_0\)

  • SQLT now has only 40 down sites (from 1672)
  • TMM Groups only dropped from 686 to 143

MA Plots: GATA3



  • Small positive bias in Library Size only (LSQL & LT)
  • Small positive bias in TMM & TMM-Groups
    • Is this why more gained sites?
  • No apparent bias in SQLT
    • Is ‘under-performance’ actually good?
  • SQLT also appears more range-restricted

MA Plots: H3K27ac



  • Bias in all methods bar SQLT
  • Most likely due to outlier sample
  • Doesn’t quite explain the results
    • Difference was mainly in Gains

logFC Comparison: LSQL Vs LT



  • Estimates from GLM-QL model are consistently higher
  • Far more so in the lowest signal quartile
  • Similar patterns in most datasets

logFC Comparison: SQLT Vs TMM-Groups



  • Effect is even more pronounced for estimates using within-group normalisation
  • Do I need to change the treat threshold?
    • Does GLM-20% = SQLT-18% in high quartile
    • Does GLM-20% = SQLT-10% in the low quartile?

Final Thoughts

  • Not all estimates of logFC are the same
    • Becomes important when using treat
    • 20% in GLM-QL \(\neq\) 20% under voom or limma-trend
  • GLM-QL approaches will give more significance at the low end
    • Is that what we really want?
  • May well be the same for RNA-Seq
    • Is it the prior.counts restricting the range for logCPM?

Final Thoughts

  • Within group normalisation was effective (quantro \(p < \alpha\))
    • Within groups also helped for ‘different distribution’ datasets
    • All-Sample TMM normalisation otherwise?
    • Some bias was evident at the window level
  • SQLT might still be viable
    • Didn’t appear to disadvantage if not required
    • Reduced technical variability nicely (PCA)
    • Less bias at the window-level
    • More likely to capture higher-signal results
    • Fewer sites \(\implies\) but is this good?

Final Thoughts

  • Harmonic Mean \(p\) still looks really good

(Did I mention my 100x speed-up?)

References

Hickey, Theresa E, Luke A Selth, Kee Ming Chia, Geraldine Laven-Law, Heloisa H Milioli, Daniel Roden, Shalini Jindal, et al. 2021. “The Androgen Receptor Is a Tumor Suppressor in Estrogen Receptor-Positive Breast Cancer.” Nat. Med. 27 (2): 310–20.
Hicks, Stephanie C, and Rafael A Irizarry. 2015. “Quantro: A Data-Driven Approach to Guide the Choice of an Appropriate Normalization Method.” Genome Biol. 16 (1): 117.
Hicks, Stephanie C., Kwame Okrah, Joseph N. Paulson, John Quackenbush, Rafael A. Irizarry, and Hector Corrado Bravo. 2018. “Smooth Quantile Normalization.” Biostatistics 19 (2).
Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts.” Genome Biol. 15 (2): R29.
Lun, Aaron T L, and Gordon K Smyth. 2014. De Novo Detection of Differentially Bound Regions for ChIP-Seq Data Using Peaks and Windows: Controlling Error Rates Correctly.” Nucleic Acids Res. 42 (11): e95.
———. 2016. “Csaw: A Bioconductor Package for Differential Binding Analysis of ChIP-Seq Data Using Sliding Windows.” Nucleic Acids Res. 44 (5): e45.
McCarthy, Davis J, and Gordon K Smyth. 2009. “Testing Significance Relative to a Fold-Change Threshold Is a TREAT.” Bioinformatics 25 (6): 765–71.
Ross-Innes, Caryn S., Rory Stark, Andrew E. Teschendorff, Kelly A. Holmes, H. Raza Ali, Mark J. Dunning, Gordon D. Brown, et al. 2012. “Differential Oestrogen Receptor Binding Is Associated with Clinical Outcome in Breast Cancer.” Nature 481: –4. http://www.nature.com/nature/journal/v481/n7381/full/nature10730.html.
Wilson, Daniel J. 2019. “The Harmonic Mean p-Value for Combining Dependent Tests.” Proc. Natl. Acad. Sci. U. S. A. 116 (4): 1195–1200.