Skip to contents

Simulate a set of fixed-width sequences using optional TFBMs

Usage

simSeq(
  n,
  width,
  pfm = NULL,
  nt = c("A", "C", "G", "T"),
  prob = rep(0.25, 4),
  shape1 = 1,
  shape2 = 1,
  as = "DNAStringSet",
  ...
)

Arguments

n

The number of sequences to simulate

width

Width of sequences to simulate

pfm

Probability Weight/Frequency Matrix

nt

Nucleotides to include

prob

Sampling probablities for each nucleotide

shape1, shape2

Passed to rbetabinom.ab

as

ObjectClass to return objects as. Defaults to DNAStringSet, but other viable options may include 'character', 'CharacterList' or any other class from which a character vector may be coerced.

...

Not used

Value

By default a DNAStringSet will be returned. If possible, the position of any randomly sampled motifs will be included in the mcols element of the returned object.

Details

Using the nucleotide and probabilities provided as set of sequences can be simulated. By default, this will effectively be a set of 'background' sequences, with letters effectively chosen at random.

If a PWM/PFM is supplied, the shape parameters are first passed to rbetabinom.ab to determine the random positions the motif will be placed, with the default parameters representing a discrete uniform distribution. Once positions for the TFBM have been selected, nucleotides will be randomly sampled using the probabilities provided in the PWM and these motifs will be placed at the randomly sample positions

Examples

## Randomly generate 10x50nt sequences without any TFBMs present
simSeq(10, 50)
#> DNAStringSet object of length 10:
#>      width seq
#>  [1]    50 GCTGCATACAAGCCCAAGTTGCTAATTGAACGCAGTAGGGGTTCTAGAGC
#>  [2]    50 GACTTCGATGGACGAGAGCGGGCGAAGAATAGGTTTCCAAAATGTGTCCT
#>  [3]    50 GGTGCTATAATTGGTTACCCGGAGCTTGAAGATAGCTGTGCATTCAATAA
#>  [4]    50 TGGCCCTGCGTTCGACACTATTGGCACGCGCATCCCATCGTTATCCAGTA
#>  [5]    50 GATGGTCGCATTGTGCAAACATGTATGGGAAGGACGTCCGTGGTGTCTGA
#>  [6]    50 CCGATTTGTGTCGAGATTGGCTTCGCTCAATCGTAATTAATACACTGTGC
#>  [7]    50 TAGTTAGATCCATTAGTCGTTTTAGCGCTTTATTCGAAAGTCCCTCAAAG
#>  [8]    50 TACTTGCACTACGTCACGGAGAAATAACAATGAAGGATTTTGGACTTAGA
#>  [9]    50 AGCTGCAGCCCGGGGTAAATAAGCAGTGAGCCTTTTCATCTATCATTGTC
#> [10]    50 GGGGCAGAGTTCGCTCCAGTCATGACGCAGCGACGGCCCCACACGAGTCA

## Now place a motif at random positions
data('ex_pfm')
sim_seq <- simSeq(10, width = 20, pfm = ex_pfm$ESR1)
sim_seq
#> DNAStringSet object of length 10:
#>      width seq
#>  [1]    20 TGCGCAGGTCACAATGACAT
#>  [2]    20 AGACGGGTTAGCATGCCCCT
#>  [3]    20 CTAAAGAGCACCATGACCCA
#>  [4]    20 AGGGCAGAATGCCCTTCGCT
#>  [5]    20 GGGGGTCAGACTGACCTGAT
#>  [6]    20 CTCGGGGACAGCGTGACCCT
#>  [7]    20 AGGTTGCAATGACTCCTTTT
#>  [8]    20 CAAAATGTCAGCACACCCCC
#>  [9]    20 TCTAGGTCAGCCTGACCCCT
#> [10]    20 GGGAGGGCACCCTGCCCTTC
## The position of the motif within each sequence is included in the mcols
mcols(sim_seq)
#> DataFrame with 10 rows and 1 column
#>          pos
#>    <numeric>
#> 1          6
#> 2          5
#> 3          5
#> 4          1
#> 5          3
#> 6          5
#> 7          1
#> 8          5
#> 9          4
#> 10         4
## Use this to extract the random motifs from the random sequences
i <- mcols(sim_seq)$pos + cumsum(width(sim_seq)) - width(sim_seq)
Views(unlist(sim_seq), start = i, width = 10)
#> Views on a 200-letter DNAString subject
#> subject: TGCGCAGGTCACAATGACATAGACGGGTTAGCAT...TCAGCCTGACCCCTGGGAGGGCACCCTGCCCTTC
#> views:
#>        start end width
#>    [1]     6  15    10 [AGGTCACAAT]
#>    [2]    25  34    10 [GGGTTAGCAT]
#>    [3]    45  54    10 [AGAGCACCAT]
#>    [4]    61  70    10 [AGGGCAGAAT]
#>    [5]    83  92    10 [GGGTCAGACT]
#>    [6]   105 114    10 [GGGACAGCGT]
#>    [7]   121 130    10 [AGGTTGCAAT]
#>    [8]   145 154    10 [ATGTCAGCAC]
#>    [9]   164 173    10 [AGGTCAGCCT]
#>   [10]   184 193    10 [AGGGCACCCT]