Test for motif enrichment within a set of sequences using a background set to derive a NULL hypothesis

testMotifEnrich(
  pwm,
  stringset,
  bg,
  var = "iteration",
  model = c("quasipoisson", "hypergeometric", "poisson", "iteration"),
  sort_by = c("p", "none"),
  mc.cores = 1,
  ...
)

Arguments

pwm

A Position Weight Matrix or list of PWMs

stringset

An XStringSet with equal sequence widths

bg

An XStringSet with the same sequence widths as the test XStringset

var

A column in the mcols element of bg, usually denoting an iteration number

model

The model used for analysis

sort_by

Column to sort results by

mc.cores

Passed to mclapply

...

Passed to getPwmMatches or countPwmMatches

Value

A data.frame with columns: sequences, matches, expected, enrichment, and p, with additional columns Z, est_bg_rate (Poisson), odds_ratio

(Hypergeometric) or Z, sd_bg, n_iter and iter_p (Iterations). The numbers of sequences and matches refer to the test set of sequences, whilst expected is the expected number of matches under the Poisson or iterative null distribution. The ratio of matches to expected is given as enrichment, along with the Z score and p-value. Whilst the Z score is only representative under the Poisson model, it is used to directly estimate p-values under the iterative approach. Under this latter approach, the sd of the matches found in the background sequences is also given, along with the number of iterations and the p-values from permutations testing the one-sided hypothesis hypothesis for enrichment.

It may also be worth noting that if producing background sequences using makeRMRanges with replace = TRUE and force_ol = TRUE, the iterative model corresponds to a bootstrap, given that the test sequences will overlap the background sequences and background ranges are able to be sampled with replacement.

Details

This function offers four alternative models for assessing the enrichment of a motif within a set of sequences, in comparison to a background set of sequences. Selection of the BG sequences plays an important role and, in conjunction with the question being addressed, determines the most appropriate model to use for testing, as described below.

It should also be noted that the larger the BG set of sequences, the larger the computational burden, and results can take far longer to return. For many millions of background sequences, this may run beyond an hour

Descriptions of Models and Use Cases

Hypergeometric Tests

Hypergeometric tests are best suited to the use case where the test set of sequences represents a subset of a larger set, with a specific feature or behaviour, whilst the BG set may be the remainder of the set without that feature. For example, the test set may represent ChIP-Seq binding sites where signal changes in response to treatment, whilst the BG set represents the sites where no changed signal was observed. Testing is one-sided, for enrichment of motifs within the test set.

Due to these relatively smaller sized datasets, setting model = "hypergeometric", will generally return results quickly

Poisson Tests

This approach requires a set of background sequences which should be much larger than the test set of sequences. The parameters for a Poisson model are estimated in a per-sequence manner on the set of BG sequences, and the observed rate of motif-matches within the test set is then tested using poisson.test.

This approach assumes that all matches follow a Poisson distribution, which is often true, but data can also be overdispersed. Given that this model can also return results relatively quickly, is it primarily suitable for data exploration, but not for final results.

Quasi-Poisson Test

The quasipoisson model allows for over-dispersion and will return more conservative results than using the standard Poisson model. Under the method currently implemented here, BG sequences should be divided into blocks (i.e. iterations), identical in size to the test set of sequences. Model parameters are estimated per iteration across the BG set of sequences, with the rate of matches in the test being compared against these.

It is expected that the BG set will matched for the features of interest and chosen using makeRMRanges with a large number of iterations, e.g. n_iter = 1000. Due to this parameterisation, quasipoisson approaches can be computationally time-consuming, as this is effectively an iterative approach.

Iteration

Setting the model as "iteration" performs a non-parametric analysis, with the exception of returning Z-scores under the Central Limit Theorem. Mean and SD of matches is found for each iteration, and used to return Z scores, with p-values returned from both a Z-test and from comparing observed values directly to sampled values obtained from the BG sequences. Sampled values are calculated directly and as such, are limited in precision.

As for the QuasiPoisson model, a very large number of iterations is expected to be used, to ensure the CLT holds, again making this a computationally demanding test. Each iteration/block is expected to be identically-sized to the test set, and matched for any features as appropriate using makeRMRanges().

Examples

## Load the example peaks & the sequences
data("ar_er_peaks")
data("ar_er_seq")
sq <- seqinfo(ar_er_peaks)
## Now sample size-matched ranges 10 times larger. In real-world analyses,
## this set should be sampled as at least 1000x larger, ensuring features
## are matched to your requirements. This example masks regions with known N
## content, including centromeres & telomeres
data("hg19_mask")
set.seed(305)
bg_ranges <- makeRMRanges(
  ar_er_peaks, GRanges(sq)[1], exclude = hg19_mask, n_iter = 10
)

## Convert ranges to DNAStringSets
library(BSgenome.Hsapiens.UCSC.hg19)
#> Loading required package: BSgenome
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#> 
#> Attaching package: ‘rtracklayer’
#> The following object is masked from ‘package:BiocIO’:
#> 
#>     FileForFormat
genome <- BSgenome.Hsapiens.UCSC.hg19
bg_seq <- getSeq(genome, bg_ranges)

## Test for enrichment of the ESR1 motif
data("ex_pwm")
esr1 <- ex_pwm$ESR1
testMotifEnrich(esr1, ar_er_seq, bg_seq, model = "poisson")
#>   sequences matches expected enrichment        Z            p          fdr
#> 1       229      62     12.3    5.04065 14.17111 6.735011e-24 6.735011e-24
#>   est_bg_rate
#> 1  0.05371179

## Test all motifs
testMotifEnrich(ex_pwm, ar_er_seq, bg_seq, model = "poisson")
#>       sequences matches expected enrichment          Z            p
#> FOXA1       229     334    178.4  1.8721973 11.6496310 2.693669e-25
#> ESR1        229      62     12.3  5.0406504 14.1711088 6.735011e-24
#> ZN281       229      63     48.9  1.2883436  2.0163443 5.287479e-02
#> ZN143       229       1      0.6  1.6666667  0.5163978 4.511884e-01
#> ANDR        229      61     64.9  0.9399076 -0.4841080 7.092503e-01
#>                fdr est_bg_rate
#> FOXA1 1.346835e-24 0.779039301
#> ESR1  1.683753e-23 0.053711790
#> ZN281 8.812465e-02 0.213537118
#> ZN143 5.639855e-01 0.002620087
#> ANDR  7.092503e-01 0.283406114