Apply two filters to counts generated using sliding windows

dualFilter(
  x,
  bg = NULL,
  ref,
  q = 0.5,
  logCPM = TRUE,
  keep.totals = TRUE,
  bin.size = NULL,
  prior.count = 2,
  BPPARAM = bpparam()
)

Arguments

x

RangedSummarizedExperiment containing sample counts

bg

RangedSummarizedExperiment containing background/input counts, or alternate method for selecting samples from within x, such as a logical, numeric or character vector

ref

GRanges object containing ranges where signal is expected

q

The upper percentile of the reference ranges expected to be returned when tuning the filtering criteria

logCPM

logical(1) Add a logCPM assay to the returned data

keep.totals

logical(1) Keep the original library sizes or replace using only the retained windows

bin.size

Bin sizes when calling filterWindowsControl. If not specified will default to the largest of 2000bp or 10x the window size

prior.count

Passed to filterWindowsControl and filterWindowsProportion

BPPARAM

Settings for running in parallel

Value

A RangedSummarizedExperiment which is a filtered subset of the original object. If requested the assay "logCPM" will be added (TRUE by default)

Details

This function will take sliding (or tiling) windows for it's input as a RangedSummarizedExperiment object. The dual strategy of applying filterWindowsControl and filterWindowsProportion will then be applied. A set of reference ranges for which signal is expected is used to refine the filtering criteria.

Cutoff values are found for both signal relative to input and overall signal, such that the 100*q% of the (sliding) windows which overlap a reference range will be returned, along with any others which match the dual filtering criteria. In general, higher values of q will return more windows as those with weak signal and a marginal overlap with a reference range will be returned. Lower values will ensure that fewer windows, generally with the strongest signal, are retained. Cutoff values for both criteria are added to the metadata element of the returned object.

If setting bg = NULL the filterWindowsControl step will be ignored and only the filterWindowsProportion will be used. This should only be performed if no Input sample is available.

Please note that the any .bam files referred to in the supplied objects must be accessible to this function. It will not run on a separate machine or file structure to that which the original sliding windows were prepared. Please see the example/vignette for runnable code.

Examples

# \donttest{
## Taken from the differential_binding vignette
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#>  dplyr     1.1.4      readr     2.1.5
#>  forcats   1.0.0      stringr   1.5.1
#>  lubridate 1.9.3      tidyr     1.3.1
#>  purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  lubridate::%within%() masks IRanges::%within%()
#>  ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
#>  dplyr::collapse()     masks IRanges::collapse()
#>  dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
#>  dplyr::count()        masks matrixStats::count()
#>  dplyr::desc()         masks IRanges::desc()
#>  tidyr::expand()       masks S4Vectors::expand()
#>  dplyr::filter()       masks stats::filter()
#>  dplyr::first()        masks S4Vectors::first()
#>  dplyr::lag()          masks stats::lag()
#>  purrr::reduce()       masks GenomicRanges::reduce(), IRanges::reduce()
#>  dplyr::rename()       masks S4Vectors::rename()
#>  lubridate::second()   masks S4Vectors::second()
#>  lubridate::second<-() masks S4Vectors::second<-()
#>  dplyr::slice()        masks IRanges::slice()
#>  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Rsamtools)
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'XVector'
#> The following object is masked from 'package:purrr':
#> 
#>     compact
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
library(csaw)
library(BiocParallel)
library(rtracklayer)
## For this function we need a set of counts using sliding windows and the
## original BamFiles from which they were taken
## First we'll set up the bam file list
bfl <- system.file(
    "extdata", "bam", c("ex1.bam", "ex2.bam", "input.bam"), package = "extraChIPs"
    ) %>%
    BamFileList() %>%
    setNames(c("ex1", "ex2", "input"))

## Then define the readParam settings for csaw::readParam()
rp <- readParam(
    pe = "none",
    dedup = TRUE,
    restrict = "chr10"
)

## Now we can form our sliding window object with the counts.
wincounts <- windowCounts(
    bam.files = bfl,
    spacing = 60,
    width = 180,
    ext = 200,
    filter = 1,
    param = rp
)
## As this is a subset of reads, add the initial library sizes for accuracy
## Note that this step is not normally required
wincounts$totals <- c(964076L, 989543L, 1172179L)

## We should also update the metadata for our counts
wincounts$sample <- colnames(wincounts)
wincounts$treat <- as.factor(c("ctrl", "treat", NA))
colData(wincounts)
#> DataFrame with 3 rows and 6 columns
#>                    bam.files    totals       ext      rlen      sample    treat
#>                  <character> <integer> <integer> <integer> <character> <factor>
#> ex1   /__w/_temp/Library/e..    964076       200        73         ex1    ctrl 
#> ex2   /__w/_temp/Library/e..    989543       200        74         ex2    treat
#> input /__w/_temp/Library/e..   1172179       200        74       input    NA   

## The function dualFilter requires a set of peaks which will guide the
## filtering step. This indicate where genuine signal is likely to be found
## and will perform the filtering based on a) signal above the input, and
## b) The overall signal level, using the guide set of peaks to inform the
## cutoff values for inclusion
peaks <- import.bed(
    system.file("extdata", "peaks.bed.gz", package = "extraChIPs")
)
filtcounts <- dualFilter(
    x = wincounts, bg = "input", ref = peaks,
    q = 0.8 # Better to use q = 0.5 on real data
)
filtcounts
#> class: RangedSummarizedExperiment 
#> dim: 101 2 
#> metadata(7): spacing width ... final.ext cuts
#> assays(2): counts logCPM
#> rownames: NULL
#> rowData names(1): overlaps_ref
#> colnames(2): ex1 ex2
#> colData names(6): bam.files totals ... sample treat

# }