Find all PWM matches within a set of sequences

getPwmMatches(
  pwm,
  stringset,
  rc = TRUE,
  min_score = "80%",
  best_only = FALSE,
  break_ties = c("all", "random", "first", "last", "central"),
  mc.cores = 1,
  ...
)

Arguments

pwm

A Position Weight Matrix, list of PWMs or universalmotif list

stringset

An XStringSet

rc

logical(1) Also find matches using the reverse complement of pwm

min_score

The minimum score to return a match

best_only

logical(1) Only return the best match

break_ties

Method for breaking ties when only returning the best match Ignored when all matches are returned (the default)

mc.cores

Passed to mclapply if passing multiple PWMs

...

Passed to matchPWM

Value

A DataFrame with columns: seq, score, direction, start, end, fromCentre, seq_width, and match

The first three columns describe the sequence with matches, the score of the match and whether the match was found using the forward or reverse PWM. The columns start, end and width describe the where the match was found in the sequence, whilst from_centre defines the distance between the centre of the match and the centre of the sequence being queried. The final column contains the matching fragment of the sequence as an XStringSet.

When passing a list of PWMs, a list of the above DataFrames will be returned.

Details

Taking a set of sequences as an XStringSet, find all matches above the supplied score (i.e. threshold) for a single Position Weight Matrix (PWM), generally representing a transcription factor binding motif. By default, matches are performed using the PWM as provided and the reverse complement, however this can easily be disable by setting rc = FALSE.

The function relies heavily on matchPWM and Views for speed.

When choosing to return the best match (best_only = TRUE), only the match with the highest score is returned for each sequence. Should there be tied scores, the best match can be chosen as either the first, last, most central, all tied matches, or choosing one at random (the default).

Examples

## Load the example PWM
data("ex_pwm")
esr1 <- ex_pwm$ESR1

## Load the example Peaks
data("ar_er_seq")

## Return all matches
getPwmMatches(esr1, ar_er_seq)
#> DataFrame with 62 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1           3     8.644         F       176       190         -17       400
#> 2           3     9.326         R       176       190         -17       400
#> 3           9     8.732         F       132       146         -61       400
#> 4          10     8.748         R       122       136         -71       400
#> 5          12     8.680         F       141       155         -52       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 58        203     8.672         R       270       284          77       400
#> 59        204     9.072         R       179       193         -14       400
#> 60        213     9.128         F       261       275          68       400
#> 61        217     8.694         R        89       103        -104       400
#> 62        219     8.832         R        79        93        -114       400
#>               match
#>      <DNAStringSet>
#> 1   GGGACAGGATGACCC
#> 2   GGGTCATCCTGTCCC
#> 3   GGGTAACCCTGACAT
#> 4   GGGTCAGAGAGTCCT
#> 5   AGTTCATCAAGACCT
#> ...             ...
#> 58  AGGCCATCTTGACAC
#> 59  AGGTTTCCCTGACCT
#> 60  AGGCCAACATGACCA
#> 61  GGGGCAACCTGAACT
#> 62  CGGTTACCCTGACCG

## Just the best match
getPwmMatches(esr1, ar_er_seq, best_only = TRUE)
#> DataFrame with 45 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1           3     8.644         F       176       190         -17       400
#> 2           3     9.326         R       176       190         -17       400
#> 3           9     8.732         F       132       146         -61       400
#> 4          10     8.748         R       122       136         -71       400
#> 5          12     8.680         F       141       155         -52       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 41        202     8.978         F       239       253          46       400
#> 42        203     8.672         R       270       284          77       400
#> 43        204     9.072         R       179       193         -14       400
#> 44        217     8.694         R        89       103        -104       400
#> 45        219     8.832         R        79        93        -114       400
#>               match
#>      <DNAStringSet>
#> 1   GGGACAGGATGACCC
#> 2   GGGTCATCCTGTCCC
#> 3   GGGTAACCCTGACAT
#> 4   GGGTCAGAGAGTCCT
#> 5   AGTTCATCAAGACCT
#> ...             ...
#> 41  AAGTCAACATGACCA
#> 42  AGGCCATCTTGACAC
#> 43  AGGTTTCCCTGACCT
#> 44  GGGGCAACCTGAACT
#> 45  CGGTTACCCTGACCG

## Apply multiple PWMs as a list
getPwmMatches(ex_pwm, ar_er_seq, best_only = TRUE)
#> $ESR1
#> DataFrame with 45 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1           3     8.644         F       176       190         -17       400
#> 2           3     9.326         R       176       190         -17       400
#> 3           9     8.732         F       132       146         -61       400
#> 4          10     8.748         R       122       136         -71       400
#> 5          12     8.680         F       141       155         -52       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 41        202     8.978         F       239       253          46       400
#> 42        203     8.672         R       270       284          77       400
#> 43        204     9.072         R       179       193         -14       400
#> 44        217     8.694         R        89       103        -104       400
#> 45        219     8.832         R        79        93        -114       400
#>               match
#>      <DNAStringSet>
#> 1   GGGACAGGATGACCC
#> 2   GGGTCATCCTGTCCC
#> 3   GGGTAACCCTGACAT
#> 4   GGGTCAGAGAGTCCT
#> 5   AGTTCATCAAGACCT
#> ...             ...
#> 41  AAGTCAACATGACCA
#> 42  AGGCCATCTTGACAC
#> 43  AGGTTTCCCTGACCT
#> 44  GGGGCAACCTGAACT
#> 45  CGGTTACCCTGACCG
#> 
#> $ANDR
#> DataFrame with 53 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1           5  10.10020         R       176       193       -15.5       400
#> 2           7   9.46092         R       376       393       184.5       400
#> 3          18  10.06413         F       194       211         2.5       400
#> 4          23   9.94389         R        24        41      -167.5       400
#> 5          24  10.03607         F       311       328       119.5       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 49        221   9.54108         R       350       367       158.5       400
#> 50        222  10.40882         R       124       141       -67.5       400
#> 51        225  10.43687         F       189       206        -2.5       400
#> 52        226   9.99399         F       248       265        56.5       400
#> 53        229   9.61323         R        19        36      -172.5       400
#>                  match
#>         <DNAStringSet>
#> 1   TGTTCTAGATTATTTATA
#> 2   TGTGTTTTTTTTTTTCCA
#> 3   TGTCCCTGTCTGTTTATG
#> 4   TGTTTATTTCTGTTTATC
#> 5   TGTACTTTGGAGTTTACT
#> ...                ...
#> 49  TGTGCTGATTTGATTTCT
#> 50  TGAGCTTGTTTGTTTGCT
#> 51  TGTTCTTTCGTGTTTGAC
#> 52  TGTGCTCTTCTCTTTGCA
#> 53  TTTTTTTTTTTTTTTGCA
#> 
#> $FOXA1
#> DataFrame with 179 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1           1     7.582         R       203       214         8.5       400
#> 2           2     7.248         F       161       172       -33.5       400
#> 3           4     7.748         F        72        83      -122.5       400
#> 4           6     7.502         F       155       166       -39.5       400
#> 5           6     7.650         R       187       198        -7.5       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 175       224     7.050         F       276       287        81.5       400
#> 176       226     7.078         F       189       200        -5.5       400
#> 177       228     7.496         R       261       272        66.5       400
#> 178       228     7.270         F       371       382       176.5       400
#> 179       229     7.264         R       208       219        13.5       400
#>              match
#>     <DNAStringSet>
#> 1     TGTTTATTCTGT
#> 2     TATTTACAGAGC
#> 3     TGTTTACCTTAT
#> 4     TGTTTACCCAAC
#> 5     TGTTTGCATTGC
#> ...            ...
#> 175   TGTTGACTCACT
#> 176   TATTTACAGATG
#> 177   TCTTTACTTATG
#> 178   TGTTTGCAATGG
#> 179   TGTTTATCTTTG
#> 
#> $ZN143
#> DataFrame with 1 row and 8 columns
#>         seq     score direction     start       end from_centre seq_width
#>   <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1        18    12.814         F       314       335       124.5       400
#>                    match
#>           <DNAStringSet>
#> 1 AGCATTCTGGGCAATGTCATTT
#> 
#> $ZN281
#> DataFrame with 36 rows and 8 columns
#>           seq     score direction     start       end from_centre seq_width
#>     <integer> <numeric>  <factor> <integer> <integer>   <numeric> <integer>
#> 1           2   9.30083         R        88       102        -105       400
#> 2           3   9.23237         F       278       292          85       400
#> 3          11   9.81743         R       286       300          93       400
#> 4          23   9.58506         F       277       291          84       400
#> 5          26  10.63278         F        95       109         -98       400
#> ...       ...       ...       ...       ...       ...         ...       ...
#> 32        222   9.94398         F         7        21        -186       400
#> 33        223   9.85892         F       307       321         114       400
#> 34        224   9.25726         R       124       138         -69       400
#> 35        226   9.97095         F       366       380         173       400
#> 36        228   9.65768         R       284       298          91       400
#>               match
#>      <DNAStringSet>
#> 1   AGGTGTGGGAGGAGG
#> 2   GTGAGGGTGATGGGA
#> 3   TGACGGGGGTGGGGA
#> 4   TGGTGGGGGTGGGTG
#> 5   GAATGGGGGAGGGGC
#> ...             ...
#> 32  GGGTGGAGGTGGGGG
#> 33  TGGGGGTGGAGGGGC
#> 34  AGCTGGGAGAAGGGG
#> 35  AGTAGGGGGTGGGGG
#> 36  TAATGGGGGAGGGAA
#>