Summarise the Overrepresented sequences found in one or more QC files

summariseOverrep(x, ...)

# S4 method for class 'FastpData'
summariseOverrep(x, step = c("Before", "After"), min_count = 0, ...)

# S4 method for class 'FastpDataList'
summariseOverrep(
  x,
  min_count = 0,
  step = c("Before", "After"),
  vals = c("count", "rate"),
  fn = c("mean", "sum", "max"),
  by = c("reads", "sequence"),
  ...
)

# S4 method for class 'FastqcDataList'
summariseOverrep(
  x,
  min_count = 0,
  vals = c("Count", "Percentage"),
  fn = c("mean", "sum", "max"),
  pattern = ".*",
  ...
)

# S4 method for class 'FastqcData'
summariseOverrep(
  x,
  min_count = 0,
  vals = c("Count", "Percentage"),
  fn = c("mean", "sum", "max"),
  pattern = ".*",
  by = "Filename",
  ...
)

Arguments

x

An object of a suitable class

...

Not used

step

Can be 'Before', 'After' or both to obtain data from the Before_filtering or After_filtering modules

min_count

Filter sequences with counts less than this value, both before and after filtering

vals

Values to use for creating summaries across multiple files. For FastpDataList objects these can be "count" and/or "rate", whilst for FastqcDataList objects these values can be "Count" and/or "Percentage"

fn

Functions to use when summarising values across multiple files

by

character vector of columns to summarise by. See dplyr::summarise

pattern

Regular expression to filter the Possible_Source column by

Value

A tibble

Tibble columns will vary between Fastp*, FastqcDataList and FastqcData objects. Calling this function on list-type objects will attempt to summarise the presence each over-represented sequence across all files.

In particular, FastqcData objects will provide the requested summary statistics across all sequences within a file

Details

This function prepares a useful summary of all over-represented sequences as reported by either fastp or FastQC

Examples

## For operations on a FastpData object
f <- system.file("extdata/fastp.json.gz", package = "ngsReports")
fp <- FastpData(f)
summariseOverrep(fp, min_count = 100)
#> # A tibble: 3 × 9
#>   fqName          reads sequence Before_count After_count Before_rate After_rate
#>   <chr>           <chr> <chr>           <int>       <int>       <dbl>      <dbl>
#> 1 S010_201703200… read1 AAAAAAA…          863         166   0.0000381 0.00000771
#> 2 S010_201703200… read1 TTTTTTT…          859         124   0.0000379 0.00000576
#> 3 S010_201703200… read2 TTTTTTT…         4504        2050   0.0000987 0.0000476 
#> # ℹ 2 more variables: count_change <int>, prop_change <dbl>

## Applying the function to a FastqcDataList
packageDir <- system.file("extdata", package = "ngsReports")
fl <- list.files(packageDir, pattern = "fastqc.zip", full.names = TRUE)
fdl <- FastqcDataList(fl)
summariseOverrep(fdl)
#> # A tibble: 490 × 9
#>    Sequence             Possible_Source n_samples Count_mean Count_sum Count_max
#>    <chr>                <chr>               <int>      <dbl>     <int>     <int>
#>  1 CTGCAATTGGCGGTTCAGC… Illumina Paire…         1        811       811       811
#>  2 CTACAATTGGCGGTTCAGC… Illumina Paire…         1        784       784       784
#>  3 CTTCAATTGGCGGTTCAGC… Illumina Paire…         1        752       752       752
#>  4 CTCCAATTGGCGGTTCAGC… Illumina Paire…         1        744       744       744
#>  5 TCTGGCATTGCGGTTCAGC… Illumina Paire…         1        268       268       268
#>  6 ATCTTGTATTGGATGTGTC… No Hit                  1        203       203       203
#>  7 ATCTTGTATTGGTCAGCTA… No Hit                  1        192       192       192
#>  8 TGAATTGTACTAACATAAC… No Hit                  1        188       188       188
#>  9 CTTCAATTGAGGAATTGCT… No Hit                  1        170       170       170
#> 10 CTCCAATTGTTGGAGAAGT… No Hit                  1        162       162       162
#> # ℹ 480 more rows
#> # ℹ 3 more variables: Percentage_mean <dbl>, Percentage_sum <dbl>,
#> #   Percentage_max <dbl>

# An alternative viewpoint can be obtained using
fdl |> lapply(summariseOverrep) |> dplyr::bind_rows()
#> # A tibble: 6 × 8
#>   Filename      n_sequences Count_mean Count_sum Count_max Percentage_mean
#>   <chr>               <int>      <dbl>     <int>     <int>           <dbl>
#> 1 ATTG_R1.fastq         182       72.8     13252       811           0.292
#> 2 ATTG_R2.fastq         163       57.5      9367       230           0.230
#> 3 CCGC_R1.fastq          32      155.       4952      1153           0.695
#> 4 CCGC_R2.fastq          21      103.       2156       509           0.461
#> 5 GACC_R1.fastq          49       40.4      1981       208           0.393
#> 6 GACC_R2.fastq          43       43.1      1855       211           0.419
#> # ℹ 2 more variables: Percentage_sum <dbl>, Percentage_max <dbl>