Estimate a GC Content Distribution From Sequences

Generate a GC content distribution from sequences for a given read length and fragment length

estGcDistn(x, n = 1e+06, rl = 100, fl = 200, fragSd = 30, bins = 101, ...)

# S4 method for class 'ANY'
estGcDistn(x, n = 1e+06, rl = 100, fl = 200, fragSd = 30, bins = 101, ...)

# S4 method for class 'character'
estGcDistn(x, n = 1e+06, rl = 100, fl = 200, fragSd = 30, bins = 101, ...)

# S4 method for class 'DNAStringSet'
estGcDistn(x, n = 1e+06, rl = 100, fl = 200, fragSd = 30, bins = 101, ...)

Arguments

x: DNAStringSet or path to a fasta file
n: The number of reads to sample
rl: Read Lengths to sample
fl: The mean of the fragment lengths sequenced
fragSd: The standard deviation of the fragment lengths being sequenced
bins: The number of bins to estimate
...: Not used

Value

A tibble with two columns: GC_Content and Freq denoting the proportion of GC and frequency of occurence reqpectively

Details

The function takes the supplied object and returns the theoretical GC content distribution. Using a fixed read length essentially leads to a discrete distribution so the bins argument is used to define the number of bins returned. This defaults to 101 for 0 to 100% inclusive.

The returned values are obtained by interpolating the values obtained during sampling. This avoids returned distributions with gaps and jumps as would be obtained setting readLengths at values not in multiples of 100.

Based heavily on https://github.com/mikelove/fastqcTheoreticalGC

Examples

faDir <- system.file("extdata", package = "ngsReports")
faFile <- list.files(faDir, pattern = "fasta", full.names = TRUE)
df <- estGcDistn(faFile, n = 200)