Test motif enrichment using a background set of sequences
Source:R/testMotifEnrich.R
testMotifEnrich.Rd
Test for motif enrichment within a set of sequences using a background set to derive a NULL hypothesis
Arguments
- pwm
A Position Weight Matrix or list of PWMs
- stringset
An XStringSet with equal sequence widths
- bg
An XStringSet with the same sequence widths as the test XStringset
- var
A column in the mcols element of bg, usually denoting an iteration number
- model
The model used for analysis
- sort_by
Column to sort results by
- mc.cores
Passed to mclapply
- ...
Passed to getPwmMatches or countPwmMatches
Value
A data.frame with columns: sequences
, matches
, expected
, enrichment
,
and p
, with additional columns Z
, est_bg_rate
(Poisson), odds_ratio
(Hypergeometric) or Z
, sd_bg
, n_iter
and iter_p
(Iterations).
The numbers of sequences and matches refer to the test set of sequences,
whilst expected is the expected number of matches under the Poisson or
iterative null distribution. The ratio of matches to expected is given as
enrichment
, along with the Z score and p-value. Whilst the Z score is only
representative under the Poisson model, it is used to directly estimate
p-values under the iterative approach.
Under this latter approach, the sd of the matches found in the background
sequences is also given, along with the number of iterations and the p-values
from permutations testing the one-sided hypothesis hypothesis for enrichment.
It may also be worth noting that if producing background sequences using
makeRMRanges with replace = TRUE
and force_ol = TRUE
, the
iterative model corresponds to a bootstrap, given that the test sequences
will overlap the background sequences and background ranges are able to be
sampled with replacement.
Details
This function offers four alternative models for assessing the enrichment of a motif within a set of sequences, in comparison to a background set of sequences. Selection of the BG sequences plays an important role and, in conjunction with the question being addressed, determines the most appropriate model to use for testing, as described below.
It should also be noted that the larger the BG set of sequences, the larger the computational burden, and results can take far longer to return. For many millions of background sequences, this may run beyond an hour
Descriptions of Models and Use Cases
Hypergeometric Tests
Hypergeometric tests are best suited to the use case where the test set of sequences represents a subset of a larger set, with a specific feature or behaviour, whilst the BG set may be the remainder of the set without that feature. For example, the test set may represent ChIP-Seq binding sites where signal changes in response to treatment, whilst the BG set represents the sites where no changed signal was observed. Testing is one-sided, for enrichment of motifs within the test set.
Due to these relatively smaller sized datasets, setting model = "hypergeometric", will generally return results quickly
Poisson Tests
This approach requires a set of background sequences which should be much larger than the test set of sequences. The parameters for a Poisson model are estimated in a per-sequence manner on the set of BG sequences, and the observed rate of motif-matches within the test set is then tested using poisson.test.
This approach assumes that all matches follow a Poisson distribution, which is often true, but data can also be over-dispersed. Given that this model can also return results relatively quickly, is it primarily suitable for data exploration, such as quickly checking for expected behaviours, but not for final results.
Quasi-Poisson Test
The quasipoisson model allows for over-dispersion and will return more conservative results than using the standard Poisson model. Under the method currently implemented here, BG sequences should be divided into blocks (i.e. iterations), identical in size to the test set of sequences. Model parameters are estimated per iteration across the BG set of sequences, with the rate of matches in the test set being compared against these blocks. This ensures more conservative results that if analysing test and bg sequences as collections of individual sequences.
It is expected that the BG set will matched for the features of interest and chosen using makeRMRanges with a large number of iterations, e.g. n_iter = 1000. Due to this parameterisation, quasipoisson approaches can be computationally time-consuming, as this is effectively an iterative approach.
Iteration
Setting the model as "iteration" performs a non-parametric analysis, with the exception of returning Z-scores under the Central Limit Theorem. Mean and SD of matches is found for each iteration, and used to return Z scores, with p-values returned from both a Z-test and from comparing observed values directly to sampled values obtained from the BG sequences. Sampled values are calculated directly and as such, are limited in precision.
As for the QuasiPoisson model, a very large number of iterations is expected
to be used, to ensure the CLT holds, again making this a computationally
demanding test.
Each iteration/block is expected to be identically-sized to the test set,
and matched for any features as appropriate using makeRMRanges()
.
Examples
## Load the example peaks & the sequences
data("ar_er_peaks")
data("ar_er_seq")
sq <- seqinfo(ar_er_peaks)
## Now sample size-matched ranges 10 times larger. In real-world analyses,
## this set should be sampled as at least 1000x larger, ensuring features
## are matched to your requirements. This example masks regions with known N
## content, including centromeres & telomeres
data("hg19_mask")
set.seed(305)
bg_ranges <- makeRMRanges(
ar_er_peaks, GRanges(sq)[1], exclude = hg19_mask, n_iter = 10
)
## Convert ranges to DNAStringSets
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
bg_seq <- getSeq(genome, bg_ranges)
## Test for enrichment of the ESR1 motif
data("ex_pfm")
esr1 <- ex_pfm$ESR1
testMotifEnrich(esr1, ar_er_seq, bg_seq, model = "poisson")
#> sequences matches expected enrichment Z p fdr
#> 1 849 22 0.5 44 30.40559 1.315112e-28 1.315112e-28
#> est_bg_rate
#> 1 0.0005889282
## Test all motifs
testMotifEnrich(ex_pfm, ar_er_seq, bg_seq, model = "poisson")
#> sequences matches expected enrichment Z p fdr
#> ZN143 849 21 0.0 Inf Inf 0.000000e+00 0.000000e+00
#> FOXA1 849 113 25.5 4.431373 17.327582 4.223239e-37 1.055810e-36
#> ESR1 849 22 0.5 44.000000 30.405592 1.315112e-28 2.191853e-28
#> ANDR 849 8 1.4 5.714286 5.578018 1.065480e-04 1.331850e-04
#> ZN281 849 14 6.4 2.187500 3.004164 7.912704e-03 7.912704e-03
#> est_bg_rate
#> ZN143 0.0000000000
#> FOXA1 0.0300353357
#> ESR1 0.0005889282
#> ANDR 0.0016489988
#> ZN281 0.0075382803