Summarise Overrepresented Sequences — summariseOverrep • ngsReports

Summarise the Overrepresented sequences found in one or more QC files

summariseOverrep(x, ...)

# S4 method for class 'FastpData'
summariseOverrep(x, step = c("Before", "After"), min_count = 0, ...)

# S4 method for class 'FastpDataList'
summariseOverrep(
  x,
  min_count = 0,
  step = c("Before", "After"),
  vals = c("count", "rate"),
  fn = c("mean", "sum", "max"),
  by = c("reads", "sequence"),
  ...
)

# S4 method for class 'FastqcDataList'
summariseOverrep(
  x,
  min_count = 0,
  vals = c("Count", "Percentage"),
  fn = c("mean", "sum", "max"),
  pattern = ".*",
  ...
)

# S4 method for class 'FastqcData'
summariseOverrep(
  x,
  min_count = 0,
  vals = c("Count", "Percentage"),
  fn = c("mean", "sum", "max"),
  pattern = ".*",
  by = "Filename",
  ...
)

Arguments

x: An object of a suitable class
...: Not used
step: Can be 'Before', 'After' or both to obtain data from the Before_filtering or After_filtering modules
min_count: Filter sequences with counts less than this value, both before and after filtering
vals: Values to use for creating summaries across multiple files. For FastpDataList objects these can be "count" and/or "rate", whilst for FastqcDataList objects these values can be "Count" and/or "Percentage"
fn: Functions to use when summarising values across multiple files
by: character vector of columns to summarise by. See dplyr::summarise
pattern: Regular expression to filter the Possible_Source column by

Value

A tibble

Tibble columns will vary between Fastp*, FastqcDataList and FastqcData objects. Calling this function on list-type objects will attempt to summarise the presence each over-represented sequence across all files.

In particular, FastqcData objects will provide the requested summary statistics across all sequences within a file

Details

This function prepares a useful summary of all over-represented sequences as reported by either fastp or FastQC

Examples

## For operations on a FastpData object
f <- system.file("extdata/fastp.json.gz", package = "ngsReports")
fp <- FastpData(f)
summariseOverrep(fp, min_count = 100)
#> # A tibble: 3 × 9
#>   fqName          reads sequence Before_count After_count Before_rate After_rate
#>   <chr>           <chr> <chr>           <int>       <int>       <dbl>      <dbl>
#> 1 S010_201703200… read1 AAAAAAA…          863         166   0.0000381 0.00000771
#> 2 S010_201703200… read1 TTTTTTT…          859         124   0.0000379 0.00000576
#> 3 S010_201703200… read2 TTTTTTT…         4504        2050   0.0000987 0.0000476 
#> # ℹ 2 more variables: count_change <int>, prop_change <dbl>

## Applying the function to a FastqcDataList
packageDir <- system.file("extdata", package = "ngsReports")
fl <- list.files(packageDir, pattern = "fastqc.zip", full.names = TRUE)
fdl <- FastqcDataList(fl)
summariseOverrep(fdl)
#> # A tibble: 490 × 9
#>    Sequence             Possible_Source n_samples Count_mean Count_sum Count_max
#>    <chr>                <chr>               <int>      <dbl>     <int>     <int>
#>  1 CTGCAATTGGCGGTTCAGC… Illumina Paire…         1        811       811       811
#>  2 CTACAATTGGCGGTTCAGC… Illumina Paire…         1        784       784       784
#>  3 CTTCAATTGGCGGTTCAGC… Illumina Paire…         1        752       752       752
#>  4 CTCCAATTGGCGGTTCAGC… Illumina Paire…         1        744       744       744
#>  5 TCTGGCATTGCGGTTCAGC… Illumina Paire…         1        268       268       268
#>  6 ATCTTGTATTGGATGTGTC… No Hit                  1        203       203       203
#>  7 ATCTTGTATTGGTCAGCTA… No Hit                  1        192       192       192
#>  8 TGAATTGTACTAACATAAC… No Hit                  1        188       188       188
#>  9 CTTCAATTGAGGAATTGCT… No Hit                  1        170       170       170
#> 10 CTCCAATTGTTGGAGAAGT… No Hit                  1        162       162       162
#> # ℹ 480 more rows
#> # ℹ 3 more variables: Percentage_mean <dbl>, Percentage_sum <dbl>,
#> #   Percentage_max <dbl>

# An alternative viewpoint can be obtained using
fdl |> lapply(summariseOverrep) |> dplyr::bind_rows()
#> # A tibble: 6 × 8
#>   Filename      n_sequences Count_mean Count_sum Count_max Percentage_mean
#>   <chr>               <int>      <dbl>     <int>     <int>           <dbl>
#> 1 ATTG_R1.fastq         182       72.8     13252       811           0.292
#> 2 ATTG_R2.fastq         163       57.5      9367       230           0.230
#> 3 CCGC_R1.fastq          32      155.       4952      1153           0.695
#> 4 CCGC_R2.fastq          21      103.       2156       509           0.461
#> 5 GACC_R1.fastq          49       40.4      1981       208           0.393
#> 6 GACC_R2.fastq          43       43.1      1855       211           0.419
#> # ℹ 2 more variables: Percentage_sum <dbl>, Percentage_max <dbl>