Summarise the Overrepresented sequences found in one or more QC files
summariseOverrep(x, ...)
# S4 method for class 'FastpData'
summariseOverrep(x, step = c("Before", "After"), min_count = 0, ...)
# S4 method for class 'FastpDataList'
summariseOverrep(
x,
min_count = 0,
step = c("Before", "After"),
vals = c("count", "rate"),
fn = c("mean", "sum", "max"),
by = c("reads", "sequence"),
...
)
# S4 method for class 'FastqcDataList'
summariseOverrep(
x,
min_count = 0,
vals = c("Count", "Percentage"),
fn = c("mean", "sum", "max"),
pattern = ".*",
...
)
# S4 method for class 'FastqcData'
summariseOverrep(
x,
min_count = 0,
vals = c("Count", "Percentage"),
fn = c("mean", "sum", "max"),
pattern = ".*",
by = "Filename",
...
)
An object of a suitable class
Not used
Can be 'Before', 'After' or both to obtain data from the Before_filtering or After_filtering modules
Filter sequences with counts less than this value, both before and after filtering
Values to use for creating summaries across multiple files. For FastpDataList objects these can be "count" and/or "rate", whilst for FastqcDataList objects these values can be "Count" and/or "Percentage"
Functions to use when summarising values across multiple files
character vector of columns to summarise by. See dplyr::summarise
Regular expression to filter the Possible_Source column by
A tibble
Tibble columns will vary between Fastp*, FastqcDataList and FastqcData objects. Calling this function on list-type objects will attempt to summarise the presence each over-represented sequence across all files.
In particular, FastqcData objects will provide the requested summary statistics across all sequences within a file
This function prepares a useful summary of all over-represented sequences as reported by either fastp or FastQC
## For operations on a FastpData object
f <- system.file("extdata/fastp.json.gz", package = "ngsReports")
fp <- FastpData(f)
summariseOverrep(fp, min_count = 100)
#> # A tibble: 3 × 9
#> fqName reads sequence Before_count After_count Before_rate After_rate
#> <chr> <chr> <chr> <int> <int> <dbl> <dbl>
#> 1 S010_201703200… read1 AAAAAAA… 863 166 0.0000381 0.00000771
#> 2 S010_201703200… read1 TTTTTTT… 859 124 0.0000379 0.00000576
#> 3 S010_201703200… read2 TTTTTTT… 4504 2050 0.0000987 0.0000476
#> # ℹ 2 more variables: count_change <int>, prop_change <dbl>
## Applying the function to a FastqcDataList
packageDir <- system.file("extdata", package = "ngsReports")
fl <- list.files(packageDir, pattern = "fastqc.zip", full.names = TRUE)
fdl <- FastqcDataList(fl)
summariseOverrep(fdl)
#> # A tibble: 490 × 9
#> Sequence Possible_Source n_samples Count_mean Count_sum Count_max
#> <chr> <chr> <int> <dbl> <int> <int>
#> 1 CTGCAATTGGCGGTTCAGC… Illumina Paire… 1 811 811 811
#> 2 CTACAATTGGCGGTTCAGC… Illumina Paire… 1 784 784 784
#> 3 CTTCAATTGGCGGTTCAGC… Illumina Paire… 1 752 752 752
#> 4 CTCCAATTGGCGGTTCAGC… Illumina Paire… 1 744 744 744
#> 5 TCTGGCATTGCGGTTCAGC… Illumina Paire… 1 268 268 268
#> 6 ATCTTGTATTGGATGTGTC… No Hit 1 203 203 203
#> 7 ATCTTGTATTGGTCAGCTA… No Hit 1 192 192 192
#> 8 TGAATTGTACTAACATAAC… No Hit 1 188 188 188
#> 9 CTTCAATTGAGGAATTGCT… No Hit 1 170 170 170
#> 10 CTCCAATTGTTGGAGAAGT… No Hit 1 162 162 162
#> # ℹ 480 more rows
#> # ℹ 3 more variables: Percentage_mean <dbl>, Percentage_sum <dbl>,
#> # Percentage_max <dbl>
# An alternative viewpoint can be obtained using
fdl |> lapply(summariseOverrep) |> dplyr::bind_rows()
#> # A tibble: 6 × 8
#> Filename n_sequences Count_mean Count_sum Count_max Percentage_mean
#> <chr> <int> <dbl> <int> <int> <dbl>
#> 1 ATTG_R1.fastq 182 72.8 13252 811 0.292
#> 2 ATTG_R2.fastq 163 57.5 9367 230 0.230
#> 3 CCGC_R1.fastq 32 155. 4952 1153 0.695
#> 4 CCGC_R2.fastq 21 103. 2156 509 0.461
#> 5 GACC_R1.fastq 49 40.4 1981 208 0.393
#> 6 GACC_R2.fastq 43 43.1 1855 211 0.419
#> # ℹ 2 more variables: Percentage_sum <dbl>, Percentage_max <dbl>