BiocAsia 2024

The PROPHECY Study

Currently leading the transcriptomic analysis for the PROPHECY study
- Preventing Renal, OPthalmic and Heart Events in CommunitY
Study with ~1350 SA-based Aboriginal participants
- Genomic Variants, DNA Methylation, Transcriptomics, Proteomics, Lipidomics, Metabolomics
- Inflammatory Marker Panel, Retinal Scans, ECG, Urine Samples
- \(>\) 1000 metadata categories
Will be the largest existing indigenous dataset when collected

Type II Diabetes and Complications

For Aboriginal Australians, Type II Diabetes rates are extremely high with younger onset¹

Similar for Cardiovascular & Chronic Kidney Disease

Motivation

Source of this risk remains unknown
- After controlling for known external factors \(\implies\) still highly increased risk

Is there a genetic component?
- Regulatory regions, Coding Changes, Structural Variation, etc?
Is there an epigenetic component?
- DNA Methylation

If genetic risk \(\implies\) will we be able to see it using GRCh38?

Graph-based approaches for transcriptomics are still too immature

Variation In Aboriginal Australians

About 25% of genetic variation in Aboriginal Australians is unique (Easteal et al. 2020)

Can we incorporate into transcriptomic analysis?
STARconsensus (Kaminow et al. 2022) for genomic alignments
- gene-level counts
- transcript-level analysis?
- Does it matter?

*Consensus variants* found in >50% of unrelated individuals within a population, using the 1000GP AFR population members and the complete 1000GP dataset (panhuman) in comparison to PROPHECY participants

transmogR

Able to produce a variant-modified reference transcriptome
- Using a transcriptomic reference \(\implies\) don’t need co-ordinates
- STARconsensus returns co-ordinates relative to unmodified reference

Can also create a variant-modified reference genome
- Used by salmon as decoy sequences (Srivastava et al. 2020)
- Not used for any coordinate-based information

This allows for quantification using salmon (or kallisto)
- Overdispersion estimates returned for transcript-level analysis (Baldoni et al. 2024)

Main Functions

transmogrify(): Create variant-modified transcriptome sequences
- Add optional tags to modified transcript names

genomogrify(): Create variant-modified genome sequences
- Add optional tags

digestSalmon():
- Returns a SummarizedExperiment with assays counts, scaledCounts, TPM, effectiveLength
- Overdispersions & transcript lengths returned in rowData
- Can handle divergent references, e.g. parY-masked + chrY-masked

Creating a New Reference

Requires a VCF with variants
- Or a GRanges object with ALT/REF
- Only tested with SNPs + InDels <50bp
Transcripts from GTF or exonsBy()
Genome from fasta or BSgenome
Takes about 25-30min
- Need to improve parallelisation
- Can be RAM intensive \(\implies\) generally run on a server/HPC
Output Fasta files using Biostrings::writeXStringSet()

Does It Make Any Difference?

Only tested using haploid transcriptomes
Pilot Analysis using 6 participants
- Included GRCh38 as worst case
- Personalised GRCh38 as best-case
- Also AFR, panhuman and PROPHECY consensus variant sets
Ran STARconsensus + transmogR/salmon
- Observations from STARconsensus (Kaminow et al. 2022) replicated
- PROPHECY closest to personalised \(\implies\) GRCh38 least similar

Technical Assessment

Change is shown relative to personalised references
Only a small number of fragments at the summary level
PROPHECY consistently closest to personalised in
1. Assigned Fragments
2. Concordant Fragments
3. Correct Orientation (ISF)

Technical Assessment

Looking at reads moving between genes (GRCh38 \(\rightarrow\) PROPHECY)
1 Changed alignment \(\neq\) 1 changed count
Far more divergent than summary statistics indicate
Lots of change in HLA genes
\(\implies\) stand-alone analysis
Reads changing transcript within the same gene were mostly HLA

Cohort Level Analysis

93 participants selected for a CKD analysis by the El Osta group (Baker Institute)
Performed a quick T2D vs No T2D analysis
Ran analyses in parallel
- GRCh38 vs PROPHECY-modified GRCh38
- PAR-Y excluded for males, chrY excluded for females

Compared log₂ ratios of scaled counts
- Incorporates counts AND overdispersions
Also transcript-level DTE results

Cohort Level Analysis

log2-ratio of scaled counts within each participant
- GRCh38 vs PROPHECY-modified
- Showing average change (Mean logFC) vs variation (SD logFC)
- Also baseline logCPM (from GRCh38)
Some transcripts become far more variable (x \(\approx\) 0; y \(\uparrow\))
Some become consistently higher/lower in signal estimates

Change in Transcript-Level Results

DTE Ranks: T2D vs No T2D
Most transcripts are also consistent
Some do change ranks quite significantly
Probably time to simulate data for confidence

Bonus Extra Self Promotion

I personally struggle with the MEME-Suite & conda environment
\(\implies\) developed motifTestR
- May have re-invented the wheel
Has a test for positional bias within sequences
- Analogous to centrimo
Also tests for enrichment within sequences
- poisson, quasipoisson, hypergeometric, non-parametric
Can cluster motifs and test as a cluster instead of individual motifs

Acknowledgements

Black Ochre Data Labs

Alex Brown
Jimmy Breen
Alastair Ludington
Yassine Souilmi
Sam Godwin
Liza Kretzschmar
Sam Buckberry
Holly Massacci
Katharine Brown
Adam Heterick
Justine Clark
Amanda Richards-Satour
Bastien Llamas
Mary Brushe
Sarah Munns
Rose Senesi
Kaashifah Bruce

Baker Heart & Diabetes Institute

Sam El-Osta
Ishant Khurana
Scott Maxwell
Moshe Olshansky

National Centre for Indigenous Genomics

Hardip Patel
Azure Hermes

SAHMRI / Wardliparingga

Natasha Howard
Marlie Frank
Odette Pearson
Kim Morey

University of Sydney

Jean Yang

ALIGN

Johanna Barclay
Annalee Stearne
Louise Lyons

SAGC

Paul Wang
John Salamon
Sen Wang
Renee Smith

Victor Chang Cardiac Research Institute

Jason Kovacic

References

Baldoni, Pedro L, Yunshun Chen, Soroor Hediyeh-Zadeh, Yang Liao, Xueyi Dong, Matthew E Ritchie, Wei Shi, and Gordon K Smyth. 2024. “Dividing Out Quantification Uncertainty Allows Efficient Assessment of Differential Transcript Expression with edgeR.” Nucleic Acids Res. 52 (3): e13.

Easteal, Simon, Ruth M Arkell, Renzo F Balboa, Shayne A Bellingham, Alex D Brown, Tom Calma, Matthew C Cook, et al. 2020. “Equitable Expanded Carrier Screening Needs Indigenous Clinical and Population Genomic Data.” Am. J. Hum. Genet. 107 (2): 175–82.

Kaminow, Benjamin, Sara Ballouz, Jesse Gillis, and Alexander Dobin. 2022. “Pan-Human Consensus Genome Significantly Improves the Accuracy of RNA-seq Analyses.” Genome Res. 32 (4): 738–49.

Srivastava, Avi, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro. 2020. “Alignment and Mapping Methodology Influence Transcript Abundance Estimation.” Genome Biol. 21 (1): 239.