Find all PWM matches within a set of sequences
Usage
getPwmMatches(
pwm,
stringset,
rc = TRUE,
min_score = "50%",
best_only = FALSE,
break_ties = c("all", "random", "first", "last", "central"),
mc.cores = 1,
...
)Arguments
- pwm
A Position Weight Matrix, list of PWMs or universalmotif list
- stringset
An XStringSet
- rc
logical(1) Also find matches using the reverse complement of pwm
- min_score
The minimum score to return a match
- best_only
logical(1) Only return the best match
- break_ties
Method for breaking ties when only returning the best match Ignored when all matches are returned (the default)
- mc.cores
Passed to mclapply if passing multiple PWMs
- ...
Passed to matchPWM
Value
A DataFrame with columns: seq, score, direction, start,
end, from_centre, seq_width, and match
The first three columns describe the sequence with matches, the score of
the match and whether the match was found using the forward or reverse PWM.
The columns start, end and width describe the where the match was found
in the sequence, whilst from_centre defines the distance between the centre
of the match and the centre of the sequence being queried.
The final column contains the matching fragment of the sequence as an
XStringSet.
When passing a list of PWMs, a list of the above DataFrames will be returned.
Details
Taking a set of sequences as an XStringSet, find all matches above the
supplied score (i.e. threshold) for a single Position Weight Matrix (PWM),
generally representing a transcription factor binding motif.
By default, matches are performed using the PWM as provided and the reverse
complement, however this can easily be disabled by setting rc = FALSE.
The function relies heavily on matchPWM and Views for speed.
When choosing to return the best match (best_only = TRUE), only the match
with the highest score is returned for each sequence.
Should there be tied scores, the best match can be chosen as either the first,
last, most central, all tied matches, or choosing one at random (the default).
Examples
## Load the example PWM
data("ex_pfm")
esr1 <- ex_pfm$ESR1
## Load the example Peaks
data("ar_er_seq")
## Return all matches
getPwmMatches(esr1, ar_er_seq)
#> DataFrame with 190 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 1 17.3522 R 216 230 23 400
#> 2 2 11.9459 R 187 201 -6 400
#> 3 10 15.7958 R 176 190 -17 400
#> 4 24 13.5711 F 132 146 -61 400
#> 5 29 18.0880 F 193 207 0 400
#> ... ... ... ... ... ... ... ...
#> 186 824 11.1652 R 63 77 -130 400
#> 187 826 11.2094 R 196 210 3 400
#> 188 831 14.8580 R 377 391 184 400
#> 189 832 11.1978 R 212 226 19 400
#> 190 849 16.8796 F 313 327 120 400
#> match
#> <DNAStringSet>
#> 1 TGGTCACAGTGACCT
#> 2 AGCCCAGAGTGACCT
#> 3 GGGTCATCCTGTCCC
#> 4 AGGCCACAGGGACCT
#> 5 AGGTCACCCTGGCCC
#> ... ...
#> 186 GGGTCGACCTGATCC
#> 187 AGGTCAGAATGCTCA
#> 188 AAGTCAGACTGTCCT
#> 189 AGAACAAATTGACCT
#> 190 AGGTCAGAATGACCG
## Just the best match
getPwmMatches(esr1, ar_er_seq, best_only = TRUE)
#> DataFrame with 175 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 1 17.3522 R 216 230 23 400
#> 2 2 11.9459 R 187 201 -6 400
#> 3 10 15.7958 R 176 190 -17 400
#> 4 24 13.5711 F 132 146 -61 400
#> 5 29 18.0880 F 193 207 0 400
#> ... ... ... ... ... ... ... ...
#> 171 824 11.1652 R 63 77 -130 400
#> 172 826 11.2094 R 196 210 3 400
#> 173 831 14.8580 R 377 391 184 400
#> 174 832 11.1978 R 212 226 19 400
#> 175 849 16.8796 F 313 327 120 400
#> match
#> <DNAStringSet>
#> 1 TGGTCACAGTGACCT
#> 2 AGCCCAGAGTGACCT
#> 3 GGGTCATCCTGTCCC
#> 4 AGGCCACAGGGACCT
#> 5 AGGTCACCCTGGCCC
#> ... ...
#> 171 GGGTCGACCTGATCC
#> 172 AGGTCAGAATGCTCA
#> 173 AAGTCAGACTGTCCT
#> 174 AGAACAAATTGACCT
#> 175 AGGTCAGAATGACCG
## Apply multiple PWMs as a list
getPwmMatches(ex_pfm, ar_er_seq, best_only = TRUE)
#> $ESR1
#> DataFrame with 175 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 1 17.3522 R 216 230 23 400
#> 2 2 11.9459 R 187 201 -6 400
#> 3 10 15.7958 R 176 190 -17 400
#> 4 24 13.5711 F 132 146 -61 400
#> 5 29 18.0880 F 193 207 0 400
#> ... ... ... ... ... ... ... ...
#> 171 824 11.1652 R 63 77 -130 400
#> 172 826 11.2094 R 196 210 3 400
#> 173 831 14.8580 R 377 391 184 400
#> 174 832 11.1978 R 212 226 19 400
#> 175 849 16.8796 F 313 327 120 400
#> match
#> <DNAStringSet>
#> 1 TGGTCACAGTGACCT
#> 2 AGCCCAGAGTGACCT
#> 3 GGGTCATCCTGTCCC
#> 4 AGGCCACAGGGACCT
#> 5 AGGTCACCCTGGCCC
#> ... ...
#> 171 GGGTCGACCTGATCC
#> 172 AGGTCAGAATGCTCA
#> 173 AAGTCAGACTGTCCT
#> 174 AGAACAAATTGACCT
#> 175 AGGTCAGAATGACCG
#>
#> $ANDR
#> DataFrame with 220 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 18 12.5485 F 200 217 8.5 400
#> 2 20 17.0683 R 176 193 -15.5 400
#> 3 21 14.9645 R 210 227 18.5 400
#> 4 22 13.9410 R 262 279 70.5 400
#> 5 26 12.7758 F 130 147 -61.5 400
#> ... ... ... ... ... ... ... ...
#> 216 833 23.0102 R 167 184 -24.5 400
#> 217 834 16.6073 F 189 206 -2.5 400
#> 218 840 14.0060 F 248 265 56.5 400
#> 219 841 16.1179 R 173 190 -18.5 400
#> 220 847 12.8122 R 19 36 -172.5 400
#> match
#> <DNAStringSet>
#> 1 TGTGTTGAAATATTTACA
#> 2 TGTTCTAGATTATTTATA
#> 3 TCTCCTCTCTTGTTTACT
#> 4 TTTTATATTCTGTTTATA
#> 5 TTTTCCTACAAGTTTACT
#> ... ...
#> 216 TGTTCTTTTTTGTTTGTT
#> 217 TGTTCTTTCGTGTTTGAC
#> 218 TGTGCTCTTCTCTTTGCA
#> 219 TCTGCTTTATTGTTTGTT
#> 220 TTTTTTTTTTTTTTTGCA
#>
#> $FOXA1
#> DataFrame with 563 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 5 10.3544 R 297 308 102.5 400
#> 2 7 13.7470 R 203 214 8.5 400
#> 3 9 11.9835 F 161 172 -33.5 400
#> 4 12 14.0694 F 199 210 4.5 400
#> 5 14 11.6783 F 114 125 -80.5 400
#> ... ... ... ... ... ... ... ...
#> 559 843 10.09923 R 67 78 -127.5 400
#> 560 844 14.34613 F 40 51 -154.5 400
#> 561 845 9.43295 F 243 254 48.5 400
#> 562 846 10.58692 F 371 382 176.5 400
#> 563 847 11.97986 R 208 219 13.5 400
#> match
#> <DNAStringSet>
#> 1 TATTTGCACAGA
#> 2 TGTTTATTCTGT
#> 3 TATTTACAGAGC
#> 4 TGTTTGCTTTTG
#> 5 TGTTTGCAGAGC
#> ... ...
#> 559 TGTTTGTCTTTG
#> 560 TGTTTACTTTCT
#> 561 TATTGACATTAA
#> 562 TGTTTGCAATGG
#> 563 TGTTTATCTTTG
#>
#> $ZN143
#> DataFrame with 39 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 3 24.3993 F 360 381 170.5 400
#> 2 6 16.5389 F 178 199 -11.5 400
#> 3 11 19.9562 F 140 161 -49.5 400
#> 4 30 26.8427 F 166 187 -23.5 400
#> 5 67 29.0591 F 210 231 20.5 400
#> ... ... ... ... ... ... ... ...
#> 35 829 22.4081 F 206 227 16.5 400
#> 36 830 24.5149 F 143 164 -46.5 400
#> 37 836 28.9222 R 166 187 -23.5 400
#> 38 837 30.0534 F 216 237 26.5 400
#> 39 848 22.0774 R 159 180 -30.5 400
#> match
#> <DNAStringSet>
#> 1 AGCGCCCTGGGAAATGTAGTCC
#> 2 CGCCTGCCGGTAGCTGTAGTCC
#> 3 AGCCTCATGGGGGTTGGAGTCC
#> 4 AGCCTGCCGGGAGATGTAGTTC
#> 5 GGCATGCTGGGATTTGTAGTCT
#> ... ...
#> 35 GGCATGCTAGGAGTTGTAGTGT
#> 36 TGGTTTCTGGGAATTGTAGTGT
#> 37 TGCATGCTGGGATTTGTAGTCC
#> 38 TGCATGCTGGGAGTTGTAGTCT
#> 39 GGCACTGTGGGACTCGTAGTCT
#>
#> $ZN281
#> DataFrame with 139 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 1 11.9978 R 160 174 -33 400
#> 2 2 12.4604 F 378 392 185 400
#> 3 9 11.7078 R 88 102 -105 400
#> 4 11 18.2091 R 35 49 -158 400
#> 5 15 16.4907 R 224 238 31 400
#> ... ... ... ... ... ... ... ...
#> 135 811 17.6496 R 69 83 -124 400
#> 136 815 16.6439 F 7 21 -186 400
#> 137 816 13.0790 F 307 321 114 400
#> 138 840 16.4984 F 366 380 173 400
#> 139 846 14.9216 R 284 298 91 400
#> match
#> <DNAStringSet>
#> 1 GGGGTGGGGCGGGGC
#> 2 GGCAGGGGGTGGGCC
#> 3 AGGTGTGGGAGGAGG
#> 4 CGCGGGGGGAGGGGC
#> 5 AGGTGGGGGTTGGGC
#> ... ...
#> 135 CGGAGGGGGCGGGGC
#> 136 GGGTGGAGGTGGGGG
#> 137 TGGGGGTGGAGGGGC
#> 138 AGTAGGGGGTGGGGG
#> 139 TAATGGGGGAGGGAA
#>