BiocAsia 2024

transmogR: A Bioconductor package to enable the easy
incorporation of variants into a reference transcriptome

Dr Stevie Pederson (They/Them)
Black Ochre Data Labs
The Kids Research Institute Australia / ANU

November 8, 2024

The PROPHECY Study

  • Currently leading the transcriptomic analysis for the PROPHECY study
    • Preventing Renal, OPthalmic and Heart Events in CommunitY
  • Study with ~1350 SA-based Aboriginal participants
    • Genomic Variants, DNA Methylation, Transcriptomics, Proteomics, Lipidomics, Metabolomics
    • Inflammatory Marker Panel, Retinal Scans, ECG, Urine Samples
    • \(>\) 1000 metadata categories
  • Will be the largest existing indigenous dataset when collected

Type II Diabetes and Complications

For Aboriginal Australians, Type II Diabetes rates are extremely high with younger onset1

Similar for Cardiovascular & Chronic Kidney Disease

Motivation

  • Source of this risk remains unknown
    • After controlling for known external factors \(\implies\) still highly increased risk
  • Is there a genetic component?
    • Regulatory regions, Coding Changes, Structural Variation, etc?
  • Is there an epigenetic component?
    • DNA Methylation

If genetic risk \(\implies\) will we be able to see it using GRCh38?


Graph-based approaches for transcriptomics are still too immature

Variation In Aboriginal Australians

  • Can we incorporate into transcriptomic analysis?

  • STARconsensus (Kaminow et al. 2022) for genomic alignments
    • gene-level counts
    • transcript-level analysis?
    • Does it matter?

Consensus variants found in >50% of unrelated individuals within a population, using the 1000GP AFR population members and the complete 1000GP dataset (panhuman) in comparison to PROPHECY participants

transmogR

  • Able to produce a variant-modified reference transcriptome
    • Using a transcriptomic reference \(\implies\) don’t need co-ordinates
    • STARconsensus returns co-ordinates relative to unmodified reference
  • Can also create a variant-modified reference genome
  • This allows for quantification using salmon (or kallisto)

Main Functions

  • transmogrify(): Create variant-modified transcriptome sequences
    • Add optional tags to modified transcript names
  • genomogrify(): Create variant-modified genome sequences
    • Add optional tags
  • digestSalmon():
    • Returns a SummarizedExperiment with assays counts, scaledCounts, TPM, effectiveLength
    • Overdispersions & transcript lengths returned in rowData
    • Can handle divergent references, e.g. parY-masked + chrY-masked

Creating a New Reference

  • Requires a VCF with variants
    • Or a GRanges object with ALT/REF
    • Only tested with SNPs + InDels <50bp
  • Transcripts from GTF or exonsBy()
  • Genome from fasta or BSgenome
  • Takes about 25-30min
    • Need to improve parallelisation
    • Can be RAM intensive \(\implies\) generally run on a server/HPC
  • Output Fasta files using Biostrings::writeXStringSet()

Does It Make Any Difference?

  • Only tested using haploid transcriptomes
  • Pilot Analysis using 6 participants
    • Included GRCh38 as worst case
    • Personalised GRCh38 as best-case
    • Also AFR, panhuman and PROPHECY consensus variant sets
  • Ran STARconsensus + transmogR/salmon
    • Observations from STARconsensus (Kaminow et al. 2022) replicated
    • PROPHECY closest to personalised \(\implies\) GRCh38 least similar

Technical Assessment

  • Change is shown relative to personalised references
  • Only a small number of fragments at the summary level
  • PROPHECY consistently closest to personalised in
    1. Assigned Fragments
    2. Concordant Fragments
    3. Correct Orientation (ISF)

Technical Assessment

  • Looking at reads moving between genes (GRCh38 \(\rightarrow\) PROPHECY)
  • 1 Changed alignment \(\neq\) 1 changed count
  • Far more divergent than summary statistics indicate
  • Lots of change in HLA genes
    \(\implies\) stand-alone analysis
  • Reads changing transcript within the same gene were mostly HLA

Cohort Level Analysis

  • 93 participants selected for a CKD analysis by the El Osta group (Baker Institute)
  • Performed a quick T2D vs No T2D analysis
  • Ran analyses in parallel
    • GRCh38 vs PROPHECY-modified GRCh38
    • PAR-Y excluded for males, chrY excluded for females
  • Compared log2 ratios of scaled counts
    • Incorporates counts AND overdispersions
  • Also transcript-level DTE results

Cohort Level Analysis

  • log2-ratio of scaled counts within each participant
    • GRCh38 vs PROPHECY-modified
    • Showing average change (Mean logFC) vs variation (SD logFC)
    • Also baseline logCPM (from GRCh38)
  • Some transcripts become far more variable (x \(\approx\) 0; y \(\uparrow\))
  • Some become consistently higher/lower in signal estimates

Change in Transcript-Level Results

  • DTE Ranks: T2D vs No T2D
  • Most transcripts are also consistent
  • Some do change ranks quite significantly
  • Probably time to simulate data for confidence

Bonus Extra Self Promotion

  • I personally struggle with the MEME-Suite & conda environment
    \(\implies\) developed motifTestR
    • May have re-invented the wheel
  • Has a test for positional bias within sequences
    • Analogous to centrimo
  • Also tests for enrichment within sequences
    • poisson, quasipoisson, hypergeometric, non-parametric
  • Can cluster motifs and test as a cluster instead of individual motifs

Acknowledgements


Black Ochre Data Labs

  • Alex Brown
  • Jimmy Breen
  • Alastair Ludington
  • Yassine Souilmi
  • Sam Godwin
  • Liza Kretzschmar
  • Sam Buckberry
  • Holly Massacci
  • Katharine Brown
  • Adam Heterick
  • Justine Clark
  • Amanda Richards-Satour
  • Bastien Llamas
  • Mary Brushe
  • Sarah Munns
  • Rose Senesi
  • Kaashifah Bruce

Baker Heart & Diabetes Institute

  • Sam El-Osta
  • Ishant Khurana
  • Scott Maxwell
  • Moshe Olshansky

National Centre for Indigenous Genomics

  • Hardip Patel
  • Azure Hermes

SAHMRI / Wardliparingga

  • Natasha Howard
  • Marlie Frank
  • Odette Pearson
  • Kim Morey

University of Sydney

  • Jean Yang

ALIGN

  • Johanna Barclay
  • Annalee Stearne
  • Louise Lyons

SAGC

  • Paul Wang
  • John Salamon
  • Sen Wang
  • Renee Smith

Victor Chang Cardiac Research Institute

  • Jason Kovacic

References

Baldoni, Pedro L, Yunshun Chen, Soroor Hediyeh-Zadeh, Yang Liao, Xueyi Dong, Matthew E Ritchie, Wei Shi, and Gordon K Smyth. 2024. “Dividing Out Quantification Uncertainty Allows Efficient Assessment of Differential Transcript Expression with edgeR.” Nucleic Acids Res. 52 (3): e13.
Easteal, Simon, Ruth M Arkell, Renzo F Balboa, Shayne A Bellingham, Alex D Brown, Tom Calma, Matthew C Cook, et al. 2020. “Equitable Expanded Carrier Screening Needs Indigenous Clinical and Population Genomic Data.” Am. J. Hum. Genet. 107 (2): 175–82.
Kaminow, Benjamin, Sara Ballouz, Jesse Gillis, and Alexander Dobin. 2022. “Pan-Human Consensus Genome Significantly Improves the Accuracy of RNA-seq Analyses.” Genome Res. 32 (4): 738–49.
Srivastava, Avi, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro. 2020. “Alignment and Mapping Methodology Influence Transcript Abundance Estimation.” Genome Biol. 21 (1): 239.