Skip to contents

Check and cleanup a set of variants for downstream compatibility

Usage

cleanVariants(var, ...)

# S4 method for class 'GRanges'
cleanVariants(var, ol_vars = "fail", ref_col = "REF", alt_col = "ALT", ...)

# S4 method for class 'VcfFile'
cleanVariants(
  var,
  ol_vars = "fail",
  which,
  ref_col = "REF",
  alt_col = "ALT",
  ...
)

Arguments

var

GRanges object containing the variants, or a VariantAnnotation::VcfFile

...

Not used

ol_vars

Error handling for any overlapping variants. Can take values in c("fail", "none", "first", "last", "longest", "shortest"). Default is set to fail, with additional options to drop all overlapping variants ('none'), select by genomic position ('first', 'last'), or select by the scale of change to the genome ('longest', 'shortest')

ref_col, alt_col

Column names corresponding to the reference and alternate alleles

which

Passed to VariantAnnotation::ScanVcfParam if working with a VcfFile

Value

GRanges object with any incompatible variants removed, or an error produced. The mcols will contain the columns REF and ALT, unless otherwise specified, as character vectors

Details

This function checks a set of variants for the expected structure which is required by all downstream functions in transmogR. The primary change to the data structure is that both REF and ALT columns will be set as simple character vectors.

Given the complicated variant calls that can often be produced by variant callers, additional checks performed will be to ensure that:

  • there are no overlapping variants

  • SNPs are all single nucleotides and not longer strings

  • SNPs are bi-allelic

  • Insertions contain a single nucleotide in the REF column

  • Deletions contain a single nucleotide in the ALT column

  • No missing values are present

  • All ALT/REF nucleotides conform to IUPAC codings

Examples

# Any conflicting variants will be removed
var <- GRanges(c("chr10:114468420-114468422", "chr10:114468422"))
var$REF <- c("GCC", "C")
var$ALT <- c("G", "CTAT")
var
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames              ranges strand |         REF         ALT
#>          <Rle>           <IRanges>  <Rle> | <character> <character>
#>   [1]    chr10 114468420-114468422      * |         GCC           G
#>   [2]    chr10           114468422      * |           C        CTAT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Taken from the 1000GP, the first variant would delete the C at 114468422
## whilst the second variant begins an insertion at this position.
## These are clearly conflicting. The default value for ol_vars is to fail
## with an error (ol_vars = "fail"). However, both can be removed by setting
## ol_vars = "none". A warning will always be produced.
cleanVariants(var, ol_vars = "none")
#> Warning: 2 pairs of overlapping loci found and cannot be incorporated into a modified reference.
#> All overlapping variants will be removed
#> GRanges object with 0 ranges and 2 metadata columns:
#>    seqnames    ranges strand |         REF         ALT
#>       <Rle> <IRanges>  <Rle> | <character> <character>
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Or the longest can be retained, along with multiple other options
cleanVariants(var, ol_vars = "longest")
#> Warning: 2 pairs of overlapping loci found and cannot be incorporated into a modified reference.
#> The longest overlapping locus by genomic change will be retained
#> GRanges object with 1 range and 2 metadata columns:
#>       seqnames    ranges strand |         REF         ALT
#>          <Rle> <IRanges>  <Rle> | <character> <character>
#>   [1]    chr10 114468422      * |           C        CTAT
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths