Find all PWM matches within a set of sequences
getPwmMatches(
pwm,
stringset,
rc = TRUE,
min_score = "80%",
best_only = FALSE,
break_ties = c("all", "random", "first", "last", "central"),
mc.cores = 1,
...
)
A Position Weight Matrix, list of PWMs or universalmotif list
An XStringSet
logical(1) Also find matches using the reverse complement of pwm
The minimum score to return a match
logical(1) Only return the best match
Method for breaking ties when only returning the best match Ignored when all matches are returned (the default)
Passed to mclapply if passing multiple PWMs
Passed to matchPWM
A DataFrame with columns: seq
, score
, direction
, start
,
end
, fromCentre
, seq_width
, and match
The first three columns describe the sequence with matches, the score of
the match and whether the match was found using the forward or reverse PWM.
The columns start
, end
and width
describe the where the match was found
in the sequence, whilst from_centre
defines the distance between the centre
of the match and the centre of the sequence being queried.
The final column contains the matching fragment of the sequence as an
XStringSet
.
When passing a list of PWMs, a list of the above DataFrames will be returned.
Taking a set of sequences as an XStringSet, find all matches above the
supplied score (i.e. threshold) for a single Position Weight Matrix (PWM),
generally representing a transcription factor binding motif.
By default, matches are performed using the PWM as provided and the reverse
complement, however this can easily be disable by setting rc = FALSE
.
The function relies heavily on matchPWM and Views for speed.
When choosing to return the best match (best_only = TRUE
), only the match
with the highest score is returned for each sequence.
Should there be tied scores, the best match can be chosen as either the first,
last, most central, all tied matches, or choosing one at random (the default).
## Load the example PWM
data("ex_pwm")
esr1 <- ex_pwm$ESR1
## Load the example Peaks
data("ar_er_seq")
## Return all matches
getPwmMatches(esr1, ar_er_seq)
#> DataFrame with 62 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 3 8.644 F 176 190 -17 400
#> 2 3 9.326 R 176 190 -17 400
#> 3 9 8.732 F 132 146 -61 400
#> 4 10 8.748 R 122 136 -71 400
#> 5 12 8.680 F 141 155 -52 400
#> ... ... ... ... ... ... ... ...
#> 58 203 8.672 R 270 284 77 400
#> 59 204 9.072 R 179 193 -14 400
#> 60 213 9.128 F 261 275 68 400
#> 61 217 8.694 R 89 103 -104 400
#> 62 219 8.832 R 79 93 -114 400
#> match
#> <DNAStringSet>
#> 1 GGGACAGGATGACCC
#> 2 GGGTCATCCTGTCCC
#> 3 GGGTAACCCTGACAT
#> 4 GGGTCAGAGAGTCCT
#> 5 AGTTCATCAAGACCT
#> ... ...
#> 58 AGGCCATCTTGACAC
#> 59 AGGTTTCCCTGACCT
#> 60 AGGCCAACATGACCA
#> 61 GGGGCAACCTGAACT
#> 62 CGGTTACCCTGACCG
## Just the best match
getPwmMatches(esr1, ar_er_seq, best_only = TRUE)
#> DataFrame with 45 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 3 8.644 F 176 190 -17 400
#> 2 3 9.326 R 176 190 -17 400
#> 3 9 8.732 F 132 146 -61 400
#> 4 10 8.748 R 122 136 -71 400
#> 5 12 8.680 F 141 155 -52 400
#> ... ... ... ... ... ... ... ...
#> 41 202 8.978 F 239 253 46 400
#> 42 203 8.672 R 270 284 77 400
#> 43 204 9.072 R 179 193 -14 400
#> 44 217 8.694 R 89 103 -104 400
#> 45 219 8.832 R 79 93 -114 400
#> match
#> <DNAStringSet>
#> 1 GGGACAGGATGACCC
#> 2 GGGTCATCCTGTCCC
#> 3 GGGTAACCCTGACAT
#> 4 GGGTCAGAGAGTCCT
#> 5 AGTTCATCAAGACCT
#> ... ...
#> 41 AAGTCAACATGACCA
#> 42 AGGCCATCTTGACAC
#> 43 AGGTTTCCCTGACCT
#> 44 GGGGCAACCTGAACT
#> 45 CGGTTACCCTGACCG
## Apply multiple PWMs as a list
getPwmMatches(ex_pwm, ar_er_seq, best_only = TRUE)
#> $ESR1
#> DataFrame with 45 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 3 8.644 F 176 190 -17 400
#> 2 3 9.326 R 176 190 -17 400
#> 3 9 8.732 F 132 146 -61 400
#> 4 10 8.748 R 122 136 -71 400
#> 5 12 8.680 F 141 155 -52 400
#> ... ... ... ... ... ... ... ...
#> 41 202 8.978 F 239 253 46 400
#> 42 203 8.672 R 270 284 77 400
#> 43 204 9.072 R 179 193 -14 400
#> 44 217 8.694 R 89 103 -104 400
#> 45 219 8.832 R 79 93 -114 400
#> match
#> <DNAStringSet>
#> 1 GGGACAGGATGACCC
#> 2 GGGTCATCCTGTCCC
#> 3 GGGTAACCCTGACAT
#> 4 GGGTCAGAGAGTCCT
#> 5 AGTTCATCAAGACCT
#> ... ...
#> 41 AAGTCAACATGACCA
#> 42 AGGCCATCTTGACAC
#> 43 AGGTTTCCCTGACCT
#> 44 GGGGCAACCTGAACT
#> 45 CGGTTACCCTGACCG
#>
#> $ANDR
#> DataFrame with 53 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 5 10.10020 R 176 193 -15.5 400
#> 2 7 9.46092 R 376 393 184.5 400
#> 3 18 10.06413 F 194 211 2.5 400
#> 4 23 9.94389 R 24 41 -167.5 400
#> 5 24 10.03607 F 311 328 119.5 400
#> ... ... ... ... ... ... ... ...
#> 49 221 9.54108 R 350 367 158.5 400
#> 50 222 10.40882 R 124 141 -67.5 400
#> 51 225 10.43687 F 189 206 -2.5 400
#> 52 226 9.99399 F 248 265 56.5 400
#> 53 229 9.61323 R 19 36 -172.5 400
#> match
#> <DNAStringSet>
#> 1 TGTTCTAGATTATTTATA
#> 2 TGTGTTTTTTTTTTTCCA
#> 3 TGTCCCTGTCTGTTTATG
#> 4 TGTTTATTTCTGTTTATC
#> 5 TGTACTTTGGAGTTTACT
#> ... ...
#> 49 TGTGCTGATTTGATTTCT
#> 50 TGAGCTTGTTTGTTTGCT
#> 51 TGTTCTTTCGTGTTTGAC
#> 52 TGTGCTCTTCTCTTTGCA
#> 53 TTTTTTTTTTTTTTTGCA
#>
#> $FOXA1
#> DataFrame with 179 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 1 7.582 R 203 214 8.5 400
#> 2 2 7.248 F 161 172 -33.5 400
#> 3 4 7.748 F 72 83 -122.5 400
#> 4 6 7.502 F 155 166 -39.5 400
#> 5 6 7.650 R 187 198 -7.5 400
#> ... ... ... ... ... ... ... ...
#> 175 224 7.050 F 276 287 81.5 400
#> 176 226 7.078 F 189 200 -5.5 400
#> 177 228 7.496 R 261 272 66.5 400
#> 178 228 7.270 F 371 382 176.5 400
#> 179 229 7.264 R 208 219 13.5 400
#> match
#> <DNAStringSet>
#> 1 TGTTTATTCTGT
#> 2 TATTTACAGAGC
#> 3 TGTTTACCTTAT
#> 4 TGTTTACCCAAC
#> 5 TGTTTGCATTGC
#> ... ...
#> 175 TGTTGACTCACT
#> 176 TATTTACAGATG
#> 177 TCTTTACTTATG
#> 178 TGTTTGCAATGG
#> 179 TGTTTATCTTTG
#>
#> $ZN143
#> DataFrame with 1 row and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 18 12.814 F 314 335 124.5 400
#> match
#> <DNAStringSet>
#> 1 AGCATTCTGGGCAATGTCATTT
#>
#> $ZN281
#> DataFrame with 36 rows and 8 columns
#> seq score direction start end from_centre seq_width
#> <integer> <numeric> <factor> <integer> <integer> <numeric> <integer>
#> 1 2 9.30083 R 88 102 -105 400
#> 2 3 9.23237 F 278 292 85 400
#> 3 11 9.81743 R 286 300 93 400
#> 4 23 9.58506 F 277 291 84 400
#> 5 26 10.63278 F 95 109 -98 400
#> ... ... ... ... ... ... ... ...
#> 32 222 9.94398 F 7 21 -186 400
#> 33 223 9.85892 F 307 321 114 400
#> 34 224 9.25726 R 124 138 -69 400
#> 35 226 9.97095 F 366 380 173 400
#> 36 228 9.65768 R 284 298 91 400
#> match
#> <DNAStringSet>
#> 1 AGGTGTGGGAGGAGG
#> 2 GTGAGGGTGATGGGA
#> 3 TGACGGGGGTGGGGA
#> 4 TGGTGGGGGTGGGTG
#> 5 GAATGGGGGAGGGGC
#> ... ...
#> 32 GGGTGGAGGTGGGGG
#> 33 TGGGGGTGGAGGGGC
#> 34 AGCTGGGAGAAGGGG
#> 35 AGTAGGGGGTGGGGG
#> 36 TAATGGGGGAGGGAA
#>