Map Genomic Ranges to genes using defined regulatory features

mapByFeature(
  gr,
  genes,
  prom,
  enh,
  gi,
  cols = c("gene_id", "gene_name", "symbol"),
  gr2prom = 0,
  gr2enh = 0,
  gr2gi = 0,
  gr2gene = 1e+05,
  prom2gene = 0,
  enh2gene = 1e+05,
  gi2gene = 0,
  ...
)

Arguments

gr

GRanges object with query ranges to be mapped to genes

genes

GRanges object containing genes (or any other nominal feature) to be assigned

prom

GRanges object defining promoters

enh

GRanges object defining Enhancers

gi

GInteractions object defining interactions. Mappings from interactions to genes should be performed as a separate prior step.

cols

Column names to be assigned as mcols in the output. Columns must be minimally present in genes. If all requested columns are found in any of prom, enh or gi, these pre-existing mappings will be preferentially used. Any columns not found in utilised reference objects will be ignored.

gr2prom

The maximum permissible distance between a query range and any ranges defined as promoters

gr2enh

The maximum permissible distance between a query range and any ranges defined as enhancers

gr2gi

The maximum permissible distance between a query range and any ranges defined as GInteraction anchors

gr2gene

The maximum permissible distance between a query range and genes (for ranges not otherwise mapped)

prom2gene

The maximum permissible distance between a range provided in prom and a gene

enh2gene

The maximum permissible distance between a range provided in enh and a gene

gi2gene

The maximum permissible distance between a GInteractions anchor (provided in gi) and a gene

...

Passed to findOverlaps and overlapsAny internally

Value

A GRanges object with added mcols as specified

Details

This function is able to utilise feature-level information and long-range interactions to enable better mapping of regions to genes. If provided, this essentially maps from ranges to genes using the regulatory features as a framework. The following sequential strategy is used:

  1. Ranges overlapping a promoter are assigned to that gene

  2. Ranges overlapping an enhancer are assigned to all genes within a specified distance

  3. Ranges overlapping a long-range interaction are assigned to all genes connected by the interaction

  4. Ranges with no gene assignment from the previous steps are assigned to all overlapping genes or the nearest gene within a specified distance

If information is missing for one of these steps, the algorithm will simply proceed to the next step. If no promoter, enhancer or interaction data is provided, all ranges will be simply mapped by step 4. Ranges can be mapped by any or all of the first three steps, but step 4 is mutually exclusive with the first 3 steps.

Distances between each set of features and the query range can be individually specified by modifying the gr2prom, gr2enh, gr2gi or gr2gene parameters. Distances between features and genes can also be set using the parameters prom2gene, enh2gene and gi2gene.

Additionally, if previously defined mappings are included with any of the prom, enh or gi objects, this will be used in preference to any obtained from the genes object.

Examples

## Define some genes
genes <- GRanges(c("chr1:2-10:*", "chr1:25-30:-", "chr1:31-40:+"))
genes$gene_id <- paste0("gene", seq_along(genes))
genes
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |     gene_id
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr1      2-10      * |       gene1
#>   [2]     chr1     25-30      - |       gene2
#>   [3]     chr1     31-40      + |       gene3
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Add a promoter for each gene
prom <- promoters(genes, upstream = 1, downstream = 1)
prom
#> GRanges object with 3 ranges and 1 metadata column:
#>       seqnames    ranges strand |     gene_id
#>          <Rle> <IRanges>  <Rle> | <character>
#>   [1]     chr1       1-2      * |       gene1
#>   [2]     chr1     30-31      - |       gene2
#>   [3]     chr1     30-31      + |       gene3
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## Some ranges to map
gr <- GRanges(paste0("chr1:", seq(0, 60, by = 15)))
gr
#> GRanges object with 5 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1         0      *
#>   [2]     chr1        15      *
#>   [3]     chr1        30      *
#>   [4]     chr1        45      *
#>   [5]     chr1        60      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Map so that any gene within 25bp of the range is assigned
mapByFeature(gr, genes, gr2gene = 25)
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames    ranges strand |         gene_id
#>          <Rle> <IRanges>  <Rle> | <CharacterList>
#>   [1]     chr1         0      * |           gene1
#>   [2]     chr1        15      * |           gene1
#>   [3]     chr1        30      * |           gene2
#>   [4]     chr1        45      * |           gene3
#>   [5]     chr1        60      * |           gene3
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Now use promoters to be more accurate in the gene assignment
## Given that the first range overlaps the promoter of gene1, this is a
## more targetted approach. Similarly for the third range
mapByFeature(gr, genes, prom, gr2gene = 25)
#> GRanges object with 5 ranges and 1 metadata column:
#>       seqnames    ranges strand |         gene_id
#>          <Rle> <IRanges>  <Rle> | <CharacterList>
#>   [1]     chr1         0      * |           gene1
#>   [2]     chr1        15      * |           gene1
#>   [3]     chr1        30      * |     gene2,gene3
#>   [4]     chr1        45      * |           gene3
#>   [5]     chr1        60      * |           gene3
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths