Find all PWM matches within an XStringSet — getPwmMatches • motifTestR

Find all PWM matches within a set of sequences

Usage

getPwmMatches(
  pwm,
  stringset,
  rc = TRUE,
  min_score = "80%",
  best_only = FALSE,
  break_ties = c("all", "random", "first", "last", "central"),
  mc.cores = 1,
  ...
)

Arguments

pwm: A Position Weight Matrix, list of PWMs or universalmotif list
stringset: An XStringSet
rc: logical(1) Also find matches using the reverse complement of pwm
min_score: The minimum score to return a match
best_only: logical(1) Only return the best match
break_ties: Method for breaking ties when only returning the best match Ignored when all matches are returned (the default)
mc.cores: Passed to mclapply if passing multiple PWMs
...: Passed to matchPWM

Value

A DataFrame with columns: seq, score, direction, start, end, from_centre, seq_width, and match

The first three columns describe the sequence with matches, the score of the match and whether the match was found using the forward or reverse PWM. The columns start, end and width describe the where the match was found in the sequence, whilst from_centre defines the distance between the centre of the match and the centre of the sequence being queried. The final column contains the matching fragment of the sequence as an XStringSet.

When passing a list of PWMs, a list of the above DataFrames will be returned.

Details

Taking a set of sequences as an XStringSet, find all matches above the supplied score (i.e. threshold) for a single Position Weight Matrix (PWM), generally representing a transcription factor binding motif. By default, matches are performed using the PWM as provided and the reverse complement, however this can easily be disabled by setting rc = FALSE.

The function relies heavily on matchPWM and Views for speed.

When choosing to return the best match (best_only = TRUE), only the match with the highest score is returned for each sequence. Should there be tied scores, the best match can be chosen as either the first, last, most central, all tied matches, or choosing one at random (the default).

Examples

## Load the example PWM
data("ex_pfm")
esr1 <- ex_pfm$ESR1

## Load the example Peaks
data("ar_er_seq")

## Return all matches
getPwmMatches(esr1, ar_er_seq)
#> DataFrame with 22 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1          29   18.0880         F       193       207           0       400
#> 2          34   20.8412         R       321       335         128       400
#> 3          60   17.8088         R       154       168         -39       400
#> 4          62   17.5548         R       206       220          13       400
#> 5          98   20.2850         F        13        27        -180       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 18        478   18.9927         R       134       148         -59       400
#> 19        517   19.0738         F       223       237          30       400
#> 20        552   18.4739         F       232       246          39       400
#> 21        575   17.7611         R         4        18        -189       400
#> 22        646   17.5586         R       209       223          16       400
#>               match
#>      <DNAStringSet>
#> 1   AGGTCACCCTGGCCC
#> 2   AGGTCACCGTGACCC
#> 3   AGGTGACCCTGACCT
#> 4   GGGTCACACTGTCCT
#> 5   AGGTCACAATGACCT
#> ...             ...
#> 18  AGGTCACCCTGACCG
#> 19  GGGTCAGCATGACCT
#> 20  AGGACACACTGACCT
#> 21  AGGTCACCCTAACCT
#> 22  AGGTTAGCCTGACCT

## Just the best match
getPwmMatches(esr1, ar_er_seq, best_only = TRUE)
#> DataFrame with 22 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1          29   18.0880         F       193       207           0       400
#> 2          34   20.8412         R       321       335         128       400
#> 3          60   17.8088         R       154       168         -39       400
#> 4          62   17.5548         R       206       220          13       400
#> 5          98   20.2850         F        13        27        -180       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 18        478   18.9927         R       134       148         -59       400
#> 19        517   19.0738         F       223       237          30       400
#> 20        552   18.4739         F       232       246          39       400
#> 21        575   17.7611         R         4        18        -189       400
#> 22        646   17.5586         R       209       223          16       400
#>               match
#>      <DNAStringSet>
#> 1   AGGTCACCCTGGCCC
#> 2   AGGTCACCGTGACCC
#> 3   AGGTGACCCTGACCT
#> 4   GGGTCACACTGTCCT
#> 5   AGGTCACAATGACCT
#> ...             ...
#> 18  AGGTCACCCTGACCG
#> 19  GGGTCAGCATGACCT
#> 20  AGGACACACTGACCT
#> 21  AGGTCACCCTAACCT
#> 22  AGGTTAGCCTGACCT

## Apply multiple PWMs as a list
getPwmMatches(ex_pfm, ar_er_seq, best_only = TRUE)
#> $ESR1
#> DataFrame with 22 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1          29   18.0880         F       193       207           0       400
#> 2          34   20.8412         R       321       335         128       400
#> 3          60   17.8088         R       154       168         -39       400
#> 4          62   17.5548         R       206       220          13       400
#> 5          98   20.2850         F        13        27        -180       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 18        478   18.9927         R       134       148         -59       400
#> 19        517   19.0738         F       223       237          30       400
#> 20        552   18.4739         F       232       246          39       400
#> 21        575   17.7611         R         4        18        -189       400
#> 22        646   17.5586         R       209       223          16       400
#>               match
#>      <DNAStringSet>
#> 1   AGGTCACCCTGGCCC
#> 2   AGGTCACCGTGACCC
#> 3   AGGTGACCCTGACCT
#> 4   GGGTCACACTGTCCT
#> 5   AGGTCACAATGACCT
#> ...             ...
#> 18  AGGTCACCCTGACCG
#> 19  GGGTCAGCATGACCT
#> 20  AGGACACACTGACCT
#> 21  AGGTCACCCTAACCT
#> 22  AGGTTAGCCTGACCT
#> 
#> $ANDR
#> DataFrame with 8 rows and 8 columns
#>         seq     score direction     start       end from_centre seq_width
#>   <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1        27   18.9055         F        58        75      -133.5       400
#> 2       110   20.5599         F       235       252        43.5       400
#> 3       285   18.9230         F       205       222        13.5       400
#> 4       519   18.9718         R       264       281        72.5       400
#> 5       701   20.3572         F       329       346       137.5       400
#> 6       704   20.5870         F        68        85      -123.5       400
#> 7       708   20.9669         F       278       295        86.5       400
#> 8       833   23.0102         R       167       184       -24.5       400
#>                match
#>       <DNAStringSet>
#> 1 TGTTCTTTTTTGTTGATT
#> 2 TGTCCTTTTCTGTTTATT
#> 3 TGTTCCTCTCTGTTTACC
#> 4 TGTTCAGCTTTGTTTGCT
#> 5 TGTTCTTTTGTATTTGCT
#> 6 TGTTCTTCTATGTTTATT
#> 7 TGTTCTTTATTATTTGCT
#> 8 TGTTCTTTTTTGTTTGTT
#> 
#> $FOXA1
#> DataFrame with 107 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1          12   14.0694         F       199       210         4.5       400
#> 2          16   14.4605         F       177       188       -17.5       400
#> 3          17   15.1441         F       341       352       146.5       400
#> 4          18   14.4417         F       301       312       106.5       400
#> 5          21   14.6369         R       206       217        11.5       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 103       793   14.3562         F       194       205        -0.5       400
#> 104       816   14.1175         R        63        74      -131.5       400
#> 105       817   14.9185         R       291       302        96.5       400
#> 106       826   14.5836         F       261       272        66.5       400
#> 107       844   14.3461         F        40        51      -154.5       400
#>              match
#>     <DNAStringSet>
#> 1     TGTTTGCTTTTG
#> 2     TGTTTACTTTCC
#> 3     TGTTTATTTAGG
#> 4     TGTTTATTCTGG
#> 5     TGTTTACTCAAC
#> ...            ...
#> 103   TGTTTACTTTAA
#> 104   TGTTTATTTTAG
#> 105   TGTTTACACAGT
#> 106   TATTTACTTTAG
#> 107   TGTTTACTTTCT
#> 
#> $ZN143
#> DataFrame with 15 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1          30   26.8427         F       166       187       -23.5       400
#> 2          67   29.0591         F       210       231        20.5       400
#> 3         118   29.0063         R       205       226        15.5       400
#> 4         182   28.0840         F       225       246        35.5       400
#> 5         330   27.3485         R       196       217         6.5       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 11        569   28.1714         R        10        31      -179.5       400
#> 12        750   28.4852         R       151       172       -38.5       400
#> 13        829   26.6710         R       151       172       -38.5       400
#> 14        836   28.9222         R       166       187       -23.5       400
#> 15        837   30.0534         F       216       237        26.5       400
#>                      match
#>             <DNAStringSet>
#> 1   AGCCTGCCGGGAGATGTAGTTC
#> 2   GGCATGCTGGGATTTGTAGTCT
#> 3   TGCCTCCTGGGAAATGTAGTCC
#> 4   TGCATGCTGGGAACTGTAGTCT
#> 5   AGCCTTGTGGGAGTTGTAGTTT
#> ...                    ...
#> 11  GGCATTTTGGGAGTTGTAGTTT
#> 12  CGCATGCTGGGAATTGTAGTTC
#> 13  TGCCCGCTGGGAACTGTAGTCC
#> 14  TGCATGCTGGGATTTGTAGTCC
#> 15  TGCATGCTGGGAGTTGTAGTCT
#> 
#> $ZN281
#> DataFrame with 13 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1         109   19.6553         R       369       383         176       400
#> 2         118   20.1871         F        60        74        -133       400
#> 3         122   19.3263         F        95       109         -98       400
#> 4         171   18.5467         R        84        98        -109       400
#> 5         192   19.1012         F       260       274          67       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 9         456   21.2625         R       343       357         150       400
#> 10        507   18.6243         R       175       189         -18       400
#> 11        668   20.0793         R       168       182         -25       400
#> 12        763   22.6040         R       274       288          81       400
#> 13        764   18.8379         R       310       324         117       400
#>               match
#>      <DNAStringSet>
#> 1   AGTTGGGGGAGGGGC
#> 2   GGCGGGGGGAGGGGA
#> 3   GAATGGGGGAGGGGC
#> 4   GGATGGGGGAAGGGG
#> 5   GGGAGGGGGCGGGGG
#> ...             ...
#> 9   CGGTGGGGGAGGGGG
#> 10  GGGAGGGGGAGGGAG
#> 11  GGGTGGGGGTGGGGG
#> 12  GGGTGGGGGAGGGGG
#> 13  AGTGGGGGGAGGGGA
#>