Define Genomic Regions Based on Gene Annotations

Use gene, transcript and exon annotations to define genomic regions

defineRegions(
  genes,
  transcripts,
  exons,
  promoter = c(2500, 500),
  upstream = 5000,
  intron = TRUE,
  proximal = 10000,
  simplify = FALSE,
  cols = c("gene_id", "gene_name", "transcript_id", "transcript_name"),
  ...
)

Arguments

genes, transcripts, exons: GRanges objects defining each level of annotation
promoter: Numeric vector defining upstream and/or downstream distances for a promoter. Passing a single value will define a symmetrical promoter The first value represents the upstream range
upstream: The distance from a TSS defining an upstream promoter
intron: logical(1) Separate gene bodies into introns and exons. If intron = FALSE gene bodies will simply be defined as gene bodies
proximal: Distance from a gene to be considered a proximal intergenic region. If set to 0, intergenic regions will simply be considered as uniformly intergenic
simplify: Passed internally to reduceMC and setdiffMC
cols: Column names to be retained from the supplied annotations
...: Not used

Value

A GRangesList

Details

Using GRanges annotated as genes, transcripts and exons this function will define ranges uniquely assigned to a region type using a hierarchical process. By default, these region types will be (in order) 1) Promoters, 2) Upstream Promoters, 3) Exons, 4) Introns, 5) Proximal Intergenic and 6) Distal Intergenic.

Setting intron = FALSE will replace introns and exons with a generic "Gene Body" annotation. Setting proximal = 0 will return all intergenic regions (not previously annotated as promoters or upstream promoters) to an "Intergenic" category

Notably, once a region has been defined, it is excluded from all subsequent candidate regions.

Any columns matching the names provided in cols will be returned, and it is assumed that the gene/transcript/exon ranges will contain informative columns in the mcols() element.

Examples


## Define two exons for two transcripts
sq <- Seqinfo(seqnames = "chr1", seqlengths = 50000)
e <- c("chr1:20001-21000", "chr1:29001-29950", "chr1:22001-23000", "chr1:29001-30000")
e <- GRanges(e, seqinfo = sq)
mcols(e) <- DataFrame(
  gene_id = "Gene1", transcript_id = paste0("Trans", c(1, 1, 2, 2))
)

## Define the transcript ranges
t <- unlist(endoapply(split(e, e$transcript_id), range))
t$gene_id <- "Gene1"
t$transcript_id <- names(t)
names(t) <- NULL

## Summarise to gene level
g <- reduceMC(t)
g$transcript_id <- NA_character_

## Now annotate the regions
regions <- defineRegions(genes = g, transcripts = t, exons = e)
sort(unlist(regions))
#> GRanges object with 9 ranges and 3 metadata columns:
#>                       seqnames      ranges strand |                 region
#>                          <Rle>   <IRanges>  <Rle> |            <character>
#>     distal_intergenic     chr1     1-10000      * |     Intergenic (>10kb)
#>   proximal_intergenic     chr1 10001-15000      * |     Intergenic (<10kb)
#>     upstream_promoter     chr1 15001-17500      * | Upstream Promoter (-..
#>              promoter     chr1 17501-22500      * |  Promoter (-2500/+500)
#>                  exon     chr1 22501-23000      * |                   Exon
#>                intron     chr1 23001-29000      * |                 Intron
#>                  exon     chr1 29001-30000      * |                   Exon
#>   proximal_intergenic     chr1 30001-40000      * |     Intergenic (<10kb)
#>     distal_intergenic     chr1 40001-50000      * |     Intergenic (>10kb)
#>                               gene_id   transcript_id
#>                       <CharacterList> <CharacterList>
#>     distal_intergenic            <NA>            <NA>
#>   proximal_intergenic           Gene1            <NA>
#>     upstream_promoter     Gene1,Gene1   Trans1,Trans2
#>              promoter     Gene1,Gene1   Trans1,Trans2
#>                  exon           Gene1          Trans2
#>                intron           Gene1            <NA>
#>                  exon     Gene1,Gene1   Trans1,Trans2
#>   proximal_intergenic           Gene1            <NA>
#>     distal_intergenic            <NA>            <NA>
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome

## Alternatively, collpse gene body and intergenic ranges
regions <- defineRegions(
  genes = g, transcripts = t, exons = e, intron = FALSE, proximal = 0
)
sort(unlist(regions))
#> GRanges object with 5 ranges and 3 metadata columns:
#>                     seqnames      ranges strand |                 region
#>                        <Rle>   <IRanges>  <Rle> |            <character>
#>          intergenic     chr1     1-15000      * |             Intergenic
#>   upstream_promoter     chr1 15001-17500      * | Upstream Promoter (-..
#>            promoter     chr1 17501-22500      * |  Promoter (-2500/+500)
#>           gene_body     chr1 22501-30000      * |              Gene Body
#>          intergenic     chr1 30001-50000      * |             Intergenic
#>                             gene_id   transcript_id
#>                     <CharacterList> <CharacterList>
#>          intergenic            <NA>            <NA>
#>   upstream_promoter     Gene1,Gene1   Trans1,Trans2
#>            promoter     Gene1,Gene1   Trans1,Trans2
#>           gene_body           Gene1            <NA>
#>          intergenic            <NA>            <NA>
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome