Use gene, transcript and exon annotations to define genomic regions

defineRegions(
  genes,
  transcripts,
  exons,
  promoter = c(2500, 500),
  upstream = 5000,
  intron = TRUE,
  proximal = 10000,
  simplify = FALSE,
  cols = c("gene_id", "gene_name", "transcript_id", "transcript_name"),
  ...
)

Arguments

genes, transcripts, exons

GRanges objects defining each level of annotation

promoter

Numeric vector defining upstream and/or downstream distances for a promoter. Passing a single value will define a symmetrical promoter The first value represents the upstream range

upstream

The distance from a TSS defining an upstream promoter

intron

logical(1) Separate gene bodies into introns and exons. If intron = FALSE gene bodies will simply be defined as gene bodies

proximal

Distance from a gene to be considered a proximal intergenic region. If set to 0, intergenic regions will simply be considered as uniformly intergenic

simplify

Passed internally to reduceMC and setdiffMC

cols

Column names to be retained from the supplied annotations

...

Not used

Value

A GRangesList

Details

Using GRanges annotated as genes, transcripts and exons this function will define ranges uniquely assigned to a region type using a hierarchical process. By default, these region types will be (in order) 1) Promoters, 2) Upstream Promoters, 3) Exons, 4) Introns, 5) Proximal Intergenic and 6) Distal Intergenic.

Setting intron = FALSE will replace introns and exons with a generic "Gene Body" annotation. Setting proximal = 0 will return all intergenic regions (not previously annotated as promoters or upstream promoters) to an "Intergenic" category

Notably, once a region has been defined, it is excluded from all subsequent candidate regions.

Any columns matching the names provided in cols will be returned, and it is assumed that the gene/transcript/exon ranges will contain informative columns in the mcols() element.

Examples


## Define two exons for two transcripts
sq <- Seqinfo(seqnames = "chr1", seqlengths = 50000)
e <- c("chr1:20001-21000", "chr1:29001-29950", "chr1:22001-23000", "chr1:29001-30000")
e <- GRanges(e, seqinfo = sq)
mcols(e) <- DataFrame(
  gene_id = "Gene1", transcript_id = paste0("Trans", c(1, 1, 2, 2))
)

## Define the transcript ranges
t <- unlist(endoapply(split(e, e$transcript_id), range))
t$gene_id <- "Gene1"
t$transcript_id <- names(t)
names(t) <- NULL

## Summarise to gene level
g <- reduceMC(t)
g$transcript_id <- NA_character_

## Now annotate the regions
regions <- defineRegions(genes = g, transcripts = t, exons = e)
sort(unlist(regions))
#> GRanges object with 9 ranges and 3 metadata columns:
#>                       seqnames      ranges strand |                 region
#>                          <Rle>   <IRanges>  <Rle> |            <character>
#>     distal_intergenic     chr1     1-10000      * |     Intergenic (>10kb)
#>   proximal_intergenic     chr1 10001-15000      * |     Intergenic (<10kb)
#>     upstream_promoter     chr1 15001-17500      * | Upstream Promoter (-..
#>              promoter     chr1 17501-22500      * |  Promoter (-2500/+500)
#>                  exon     chr1 22501-23000      * |                   Exon
#>                intron     chr1 23001-29000      * |                 Intron
#>                  exon     chr1 29001-30000      * |                   Exon
#>   proximal_intergenic     chr1 30001-40000      * |     Intergenic (<10kb)
#>     distal_intergenic     chr1 40001-50000      * |     Intergenic (>10kb)
#>                               gene_id   transcript_id
#>                       <CharacterList> <CharacterList>
#>     distal_intergenic            <NA>            <NA>
#>   proximal_intergenic           Gene1            <NA>
#>     upstream_promoter     Gene1,Gene1   Trans1,Trans2
#>              promoter     Gene1,Gene1   Trans1,Trans2
#>                  exon           Gene1          Trans2
#>                intron           Gene1            <NA>
#>                  exon     Gene1,Gene1   Trans1,Trans2
#>   proximal_intergenic           Gene1            <NA>
#>     distal_intergenic            <NA>            <NA>
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome

## Alternatively, collpse gene body and intergenic ranges
regions <- defineRegions(
  genes = g, transcripts = t, exons = e, intron = FALSE, proximal = 0
)
sort(unlist(regions))
#> GRanges object with 5 ranges and 3 metadata columns:
#>                     seqnames      ranges strand |                 region
#>                        <Rle>   <IRanges>  <Rle> |            <character>
#>          intergenic     chr1     1-15000      * |             Intergenic
#>   upstream_promoter     chr1 15001-17500      * | Upstream Promoter (-..
#>            promoter     chr1 17501-22500      * |  Promoter (-2500/+500)
#>           gene_body     chr1 22501-30000      * |              Gene Body
#>          intergenic     chr1 30001-50000      * |             Intergenic
#>                             gene_id   transcript_id
#>                     <CharacterList> <CharacterList>
#>          intergenic            <NA>            <NA>
#>   upstream_promoter     Gene1,Gene1   Trans1,Trans2
#>            promoter     Gene1,Gene1   Trans1,Trans2
#>           gene_body           Gene1            <NA>
#>          intergenic            <NA>            <NA>
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome