Alternate References For Transcriptomics

Adelaide Bioinformatics Seminars

Dr Stevie Pederson (They/Them)
Black Ochre Data Labs / ANU

May 3, 2024

Acknowledgement of Country

I would like to acknowledge that many of us are meeting today on Kaurna Country.

I acknowledge the deep feelings of attachment and relationship of the Kaurna people to their Place.

I also pay my respects to the cultural authority of Aboriginal and Torres Strait Islander peoples from other areas of Australia online today, and pay my respects to Elders past, present and emerging.

About Me

Postdoctoral Fellow, Black Ochre Data Labs, Indigenous Genomics, Telethon Kids Institute

2020-2022: Dame Roma Mitchell Cancer Research Labs, University of Adelaide

2014-2020: Bioinformatics Hub, University of Adelaide

2008-2018: Slowest PhD Student in the world…

Dec 2002: First used R & limma

1^st Year B.Sc. in 1986 \(\implies\) Dropped out in 1987

Bioconductor Package Developer

ngsReports is setup for parsing FastQC & Fastp files across large experiments, as well as lots of other log files (AdapterRemoval, cutadapt, STAR, salmon, bowtie etc)
- I still use this most weeks for my own work so it’s actively maintained
extraChIPs is for ChIP-Seq analysis & integration across ChIP targets
- I found DiffBind super-frustrating so wrote this to make my life easier
- Also use many of the utility functions very regularly
Also very pleased for Dr Nora Liu to have sSNAPPY available on Bioc, which is an enrichment package for challenging clinical samples where you have pre/post treatment
Lachlan Baer’s package tadar for dealing with differential allelic representation when using CRISPR etc to generate mutants
Two new packages in Bioc 3.19 (released on Wednesday)
- transmogR relates to the main topic for today
- motifTesR was a side project, but might be useful for many

motifTestR

Was adding motif enrichment testing to the GRAVI workflow
Decided to avoid the MEME-Suite 🤯
- Plays badly with conda
- I struggle to interpret the results
- Not R native (although memes does wrap some of it)

Particularly interested in motif positions within a set of sequence
- Analogous to centrimo
Added enrichment testing because I could

motifTestR: Positional Bias

If no positional bias \(\implies\) motifs will match uniformly across the sequence width

Break sequences into bins
Assign probabilities (\(\pi_i\)) of matches within bins \(i = 1, ... I\)
- Proportion of the total viable positions for a match within each bin
Test matches in bin \(i\) are as expected using \(\pi_i\) \(\implies p_i\)
- binomial.test()
Use harmonic mean p-value to obtain a single p-value for each motif
- Returns ‘enriched sequence regions’ where \(p_i\) < \(HMP(p.)\)

motifTestR: Positional Bias

Can be performed using:
1. absolute distance from centre or
2. the entire sequence width

All sequences must be the same length
Relies on a single ‘best match’ per sequence (but also applies weights)
- Also finds motif matches for you
- Heavy reliance on Biostrings

7,000 seq \(\times\) 400bp \(\times\) 400 motifs \(\implies\) < 30sec (phoenix: 8 cores)
50,000 seq \(\times\) 800bp \(\times\) 400 motifs \(\implies\) ~2min (phoenix: 8 cores)

motifTestR: Positional Bias Plots

A <- top_matches %>% 
  plotMatchPos(
    se = FALSE, abs = TRUE, linewidth = 1/2,
    binwidth = 5
  ) +
  labs(
    x = "Distance From Centre", 
    y = "Proportion", 
    colour = "Motifs In Cluster"
  ) +
  theme(legend.position = "none")
B <- top_matches %>% 
  plotMatchPos(
    type = "cdf", geom = "line", abs = TRUE,
    linewidth = 1 / 2, binwidth = 1
  ) +
  labs(
    x = "Distance From Centre", 
    y = "Proportion", 
    colour = "Representative Motif"
  )
A + B

Sequences were taken from AR ChIP-Seq binding sites

motifTestR: Positional Bias Plots

top_matches %>% 
  plotMatchPos(
    geom = "point", abs =TRUE, use_totals = TRUE,
    binwidth = 1
  ) +
  geom_smooth(
    se = FALSE, colour = "black", 
    linewidth = 1/2, method = 'loess'
  ) +
  facet_wrap(~name, labeller = as_labeller(lb)) +
  theme(legend.position = "none") +
  labs(
    x = "Distance From Sequence Centre", 
    y = "Total Matches"
  )

Show the matches at each position

motifTestR: Positional Bias Plots

motif_list %>% 
    dplyr::filter(altname %in% top_pos) %>% 
    to_list() %>% 
    getPwmMatches(
      fw_seq, abs = TRUE, best_only = TRUE, 
      mc.cores = threads
    ) %>% 
    plotMatchPos(
      abs = TRUE, type = "heatmap", 
      cluster = TRUE, use_totals = TRUE,
      binwidth = 10
    ) + 
    labs(
    x = "Bin", fill = "Total\n Matches"
    )

Clustering & heatmaps can help identify redundancy
Useful for larger sets of motifs / matches

motifTestR: Motif Enrichment

Hypergeometric: Subset A vs Subset B
- Specifically for sequences where one subset changes & the other doesn’t
- Realistically, results are specific to the each BG set

Poisson: SetA Vs BG
- BG Set >>> Set A
- Estimates match rate per sequence in BG \(\implies\) poisson.test()
- Quick(ish) & dirty
- Explicitly assumes matches are Poisson

QuasiPoisson: SetA Vs BG (In Blocks)
- Tests match rate per block (i.e. iteration) in BG
- Allows for overdispersion

Iterations: SetA Vs BG (In Blocks)
- No distributional assumptions (except CLT)
- Derives mean & sd per iteration in BG
  \(\implies\) \(Z\)-score + p-value

motifTestR: `makeRMRanges()`

Added function for obtaining Random Matched Ranges for a test set
Relies on a discrete set of features
- e.g. promoters, enhancers, etc

Samples RMRanges to exactly match the feature distribution of the test set
- Can be sampled in blocks for iterative methods (quasipoisson / iteration)

motifTestR: Motif Enrichment

Is much, much slower… 😞

7000 \(\times\) 400bp \(\times\) 400 motifs (quasipoisson):
- Using 7000 \(\times\) 1000 = 7x10⁶ RMRanges
- 7 minutes (phoenix: 8 cores)

50,000 \(\times\) 800bp \(\times\) 400 motifs (quasipoisson):
- Using 50,000 \(\times\) 1000 = 5x10⁷ RMRanges
- 1hr 35min minutes (phoenix: 8 cores)

400 Vs 6500 \(\times\) 400bp \(\times\) 400 motifs (hypergeometric):
- 35 sec (phoenix: 8 cores)

This dog wants you to try it…

via GIPHY

And Now My Real Talk…

Current Research

PROPHECY (Preventing Renal OPthalmic and Heart Events in CommunitY)
Multi-Omics study led by Prof Alex Brown, with Chief Data Scientist A/Prof Jimmy Breen

Investigating the very high rates of Diabetes within Indigenous Australians
- Also very high rates of complications: CVD & CKD

Is strongly led by Community in partnership with BODL
- Hoping to bring benefits of Precision Medicine
- New approach to Indigenous Health \(\implies\) listen first

Life Expectancy Gaps

Courtesy of A/Prof Jimmy Breen, Black Ochre Data Labs, TKI

Mortality From Diabetes

Why?

Taking all socio-economic indicators \(\implies\) disease risk is still hugely increased for Indigenous Australians
Not only disease risk, but higher rates of profound complications

Are there risk alleles?
- If so, what are they? How do they contribute to risk?
Are there protective alleles?

Is there an epigenetic contribution?
Additional environmental factors?

Multi-Omic Approaches

The Transcriptomics Layer

Whole blood samples
Can we capture dynamics of disease / signatures / biomarkers etc?
Approach will be classical statistics + network approaches

Tried to be excruciatingly careful with design:
- plate layout, pooled replicates, globin-depletion, ERCC spike-ins, RIN scores etc
Library preps are finally underway
Data lands around September

Why Think About Reference Genomes?

Genetic Diversity

Indigenous Australians are almost totally absent from repositories of genetic variation
\(\sim\) 25% of variation within Indigenous Australian populations is unique & poorly characterised (Easteal et al. 2020)

If contributing to disease risk and we use a standard reference \(\implies\) will we see it?

Can we bring the reference closer to our participants’ genomes?

NCIG are very close to a pangenome graph assembled from Indigenous Australians
- A little too far off of to be viable yet
- Monica Guilhaus doing an amazing job figuring out spliced pan-transcriptome graphs (Sibbesen et al. 2023)

Gene-Level Approaches

STARconsensus (Kaminow et al. 2022) enables inclusion of a population-specific consensus variant set
- Also possible for personalised genomes
- Now has a diploid mode for phased, personalised variant sets

Incorporates variants during indexing
- Indexes twice (Standard + Modified)
- Returns alignment co-ordinates relative to the unmodified reference

Simple to obtain counts using standard methods (e.g. featureCounts)

The Downside

Locked into using STAR
- In my hands can’t align full-length, reference-derived transcripts
Counts are gene-level

We understand the distribution of counts at the gene level
- edgeR & DESeq2: well established Negative Binomial methods

Transcript-level counts have additional uncertainty
- Need to be scaled by an over-dispersion estimate
  \(\implies\) can then be treated as Negative Binomial (Baldoni et al. 2024)
- Cannot obtain these from STAR + featureCounts

Transcript-Level Analysis

salmon provides transcript-level counts with overdispersion estimates!
Can we modify a reference transcriptome using a set of variants?

via GIPHY

transmogR + salmon

Takes about 15-20 mins, even on a laptop
Modified transcripts can be optionally tagged:
- Indicating variant types (s/i/d)
- Naming the variant set (e.g. panhuman)
Generally using SNVs + InDels < 50bp

I’ve been using a modified transcriptome & modified genome as the gentrome
- The genome is used by salmon as decoy sequences (Srivastava et al. 2020)

Indexing for salmon still takes a while…

DBPAInT: Developing Best Practices for the Analysis of Indigenous Transcriptomes

The Approach

STARconsensus: using variants from the wrong population still improves performance (Kaminow et al. 2022)

1000 Genomes Project Panhuman
- alleles found in >50% of unrelated across entire project

1000 GP AFR Subpopulation
- alleles found in >50% of unrelated within the AFR population
- most distinct population from the Panhuman

PROPHECY consensus set
- alleles found in > 50% of unrelated members of the study
- Used 3 familial steps to define ‘unrelated’

The Analysis

6 female pilot samples
- All genomes are y-masked
Replicating elements of STARconsensus
Taking GRCh38 as the worst case \(\implies\) changes are improvements?
- Four sets of genomic alignments / sample
- Personalised variants arriving in the next few days (weeks?)

Also running a transmogR + salmon analysis
- Four sets of transcriptomic alignments / sample
- Haven’t figured out how to do a diploid transcriptome yet…

The Variants

PROPHECY Variants Overlapping Exons

Key Questions

Are technical metrics improved?
Does it really make a difference?
Does any difference actually capture the underlying biology?

STAR is deterministic for given reference
- Change the reference even slightly \(\implies\) no guarantees

salmon is not deterministic

Some of the changes will just be noise, but how much is noise?

Technical Analyses

No real change to number of reads aligning (STAR or salmon)
Very noticeable changes within alignments
(Nearly) all technical metrics appear to be supportive of a variant modified reference

First step is to check uniquely aligned reads from STARconsensus…

Uniquely Aligning Reads (STARconsensus)

All variant sets show an increase in the number of uniquely-aligned reads (10-30,000)

PROPHECY shows a very slight increase in the number of CIGAR M-bases per uniquely-aligned read

However…

About 1 in 300 reads moves
About 30-40,000 reads remain uniquely aligned but in a different location

How much is noise?
Are the changes shared between samples?
Do the changed reads overlap a variant?
Where exactly are they moving from and to?

Reads Overlapping Variants

Taking reads where gapped alignments overlap a variant

chr12 to chr1 shared by all
Also chr9 to chr7
chrX to chr19 shared by 1, 2, 6
chr12 to chr16 shared by 2 \(\rightarrow\) 5

Reads Overlapping Variants

Taking all samples combined

Increase in alignments to protein coding genes
Decrease in pseudogenes

Common Genes Being Impacted

Taking the 20 most common shifts

HLA-A
HLA-H
Ribosomal proteins
Histone Proteins

Quick Summary

These are only the uniquely-aligned reads from STARconsensus
- Don’t overlap alignment positions in either reference
- Haven’t looked at unmapped \(\rightarrow\) mapped, or multi-mappers
Looks like it may be consistent & true biology?
I got these results Wed night so I’m still processing them mentally

Differential Gene Expression Analysis

GRCh38 vs PROPHECY-modified
Identical input reads
All changes are alignment induced

Genes Impacted Variably

Taking expression estimates within each reference:

Find the difference in counts per gene & sample
Take mean & sd of differences

Consistent changes will be low-variability, far from zero
Inconsistent changes high-variability but near zero
Like looking through an MA plot with variability as 3^rd dimension

What about `salmon` + `transmogR`

Key Differences to STAR

Only ~30% of reads align uniquely
- Largest NH:i tag is ~24,000
- 90% of reads align with NH:i < 10

BAM files directly connect read ID to transcript
- Makes comparison much simpler

How do the transcript-level counts change?
How do the overdispersion estimates change?

Salmon Metrics

Increase in Properly-Paired alignments (+3000)
PROPHECY has strongest increase in correctly oriented fragments (+10,000 ISF)
Decrease in umapped reads (-2000)

Minor change in library sizes
Similar numbers of detected transcripts (>1 count)

Overdispersion Estimates

Looking at how the number of transcripts / gene also impacts these

Scaled Transcript-Level Counts

This seems quite acceptable

Variable Transcripts

Tracking Individual Reads

Can easily track a read & all it’s transcripts
Subset data to reads with NH:i \(\leq 10\)
Subset reads to those aligning to a modified transcript either before or after

Unique-mappers are far less informative for transcript-level alignments
Haven’t checked for moves between chromosomes
- Transcript & Gene-Level moves are more fine-grained

Tracking Reads Which Change Transcripts

Reads changing transcript within a gene

Tracking Reads Which Change Transcripts

Reads changing transcript to different genes

Changing Transcripts: Gene-Level Categories

Shows a nice decrease in reads mapping nowhere

Changing Transcripts: Transcript-Level Categories

Lots of change between but no categories really sing out

Where To From Here

Personalised Variant sets will give us the best estimate of ground-truth
- Will run STARconsensus in Diploid mode
- Haven’t figured out how to do this for transmogR + salmon
- Will just choose randomly from Het sites (???)

Running a mini 90-sample cohort to see how it really plays at scale
- Previously analysed by Prof Sam El-Osta’s team (Monash)
- Have a specific subset of complications
How will WGCNA modules be impacted?
How will true DGE / DTE results be impacted?

Acknowledgements

Black Ochre Data Labs

Alex Brown
Jimmy Breen
Sam Buckberry
Yassine Souilmi
Bastien Llamas
Katharine Browne
Liza Kretzschmar
Alastair Ludington
Holly Massacci
Sam Godwin
Kaashifah Bruce
Rebecca Simpson
Sarah Munns
Ashlee Thomson

TKI / ALIGN

Johanna Barclay
Amanda Richards-Satour
Justine Clark
Rose Senesci
Analee Stearne
Louise Lyons
Dawn Lewis
Mary Brushe
Karrina DeMasi
Phoebe McColl

NCIG

Hardip Patel

SAHMRI

Tash Howard
Marlie Frank

SAGC

Sen Wang
Paul Wang
Renee Smith

University of Adelaide

Lachlan Baer
Monica Guilhaus
Wenjun (Nora) Liu
Megan Monaghan

References

Baldoni, Pedro L, Yunshun Chen, Soroor Hediyeh-Zadeh, Yang Liao, Xueyi Dong, Matthew E Ritchie, Wei Shi, and Gordon K Smyth. 2024. “Dividing Out Quantification Uncertainty Allows Efficient Assessment of Differential Transcript Expression with edgeR.” Nucleic Acids Res. 52 (3): e13.

Easteal, Simon, Ruth M Arkell, Renzo F Balboa, Shayne A Bellingham, Alex D Brown, Tom Calma, Matthew C Cook, et al. 2020. “Equitable Expanded Carrier Screening Needs Indigenous Clinical and Population Genomic Data.” Am. J. Hum. Genet. 107 (2): 175–82.

Kaminow, Benjamin, Sara Ballouz, Jesse Gillis, and Alexander Dobin. 2022. “Pan-Human Consensus Genome Significantly Improves the Accuracy of RNA-seq Analyses.” Genome Res. 32 (4): 738–49.

Sibbesen, Jonas A, Jordan M Eizenga, Adam M Novak, Jouni Sirén, Xian Chang, Erik Garrison, and Benedict Paten. 2023. “Haplotype-Aware Pantranscriptome Analyses Using Spliced Pangenome Graphs.” Nat. Methods, January.

Srivastava, Avi, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro. 2020. “Alignment and Mapping Methodology Influence Transcript Abundance Estimation.” Genome Biol. 21 (1): 239.

Alternate References For Transcriptomics

Acknowledgement of Country

About Me

Bioconductor Package Developer

motifTestR

motifTestR: Positional Bias

motifTestR: Positional Bias

motifTestR: Positional Bias Plots

motifTestR: Positional Bias Plots

motifTestR: Positional Bias Plots

motifTestR: Motif Enrichment

motifTestR: makeRMRanges()

motifTestR: Motif Enrichment

And Now My Real Talk…

Current Research

Life Expectancy Gaps

Mortality From Diabetes

Why?

Multi-Omic Approaches

The Transcriptomics Layer

Why Think About Reference Genomes?

Genetic Diversity

Gene-Level Approaches

The Downside

Transcript-Level Analysis

transmogR + salmon

DBPAInT: Developing Best Practices for the Analysis of Indigenous Transcriptomes

The Approach

The Analysis

The Variants

PROPHECY Variants Overlapping Exons

Key Questions

Technical Analyses

Uniquely Aligning Reads (STARconsensus)

However…

Reads Overlapping Variants

Reads Overlapping Variants

Common Genes Being Impacted

Quick Summary

Differential Gene Expression Analysis

Genes Impacted Variably

What about salmon + transmogR

Key Differences to STAR

Salmon Metrics

Overdispersion Estimates

Scaled Transcript-Level Counts

Variable Transcripts

Tracking Individual Reads

Tracking Reads Which Change Transcripts

Tracking Reads Which Change Transcripts

Changing Transcripts: Gene-Level Categories

Changing Transcripts: Transcript-Level Categories

Where To From Here

Acknowledgements

Black Ochre Data Labs

TKI / ALIGN

NCIG

SAHMRI

SAGC

University of Adelaide

References

motifTestR: `makeRMRanges()`

What about `salmon` + `transmogR`