Use gene, transcript and exon annotations to define genomic regions
GRanges objects defining each level of annotation
Numeric vector defining upstream and/or downstream distances for a promoter. Passing a single value will define a symmetrical promoter The first value represents the upstream range
The distance from a TSS defining an upstream promoter
logical(1) Separate gene bodies into introns and exons. If
intron = FALSE
gene bodies will simply be defined as gene bodies
Distance from a gene to be considered a proximal intergenic region. If set to 0, intergenic regions will simply be considered as uniformly intergenic
Passed internally to reduceMC
and setdiffMC
Column names to be retained from the supplied annotations
Not used
A GRangesList
Using GRanges annotated as genes, transcripts and exons this function will define ranges uniquely assigned to a region type using a hierarchical process. By default, these region types will be (in order) 1) Promoters, 2) Upstream Promoters, 3) Exons, 4) Introns, 5) Proximal Intergenic and 6) Distal Intergenic.
Setting intron = FALSE
will replace introns and exons with a generic "Gene
Body" annotation.
Setting proximal = 0
will return all intergenic regions (not previously
annotated as promoters or upstream promoters) to an "Intergenic" category
Notably, once a region has been defined, it is excluded from all subsequent candidate regions.
Any columns matching the names provided in cols will be returned, and it is
assumed that the gene/transcript/exon ranges will contain informative columns
in the mcols()
element.
## Define two exons for two transcripts
sq <- Seqinfo(seqnames = "chr1", seqlengths = 50000)
e <- c("chr1:20001-21000", "chr1:29001-29950", "chr1:22001-23000", "chr1:29001-30000")
e <- GRanges(e, seqinfo = sq)
mcols(e) <- DataFrame(
gene_id = "Gene1", transcript_id = paste0("Trans", c(1, 1, 2, 2))
)
## Define the transcript ranges
t <- unlist(endoapply(split(e, e$transcript_id), range))
t$gene_id <- "Gene1"
t$transcript_id <- names(t)
names(t) <- NULL
## Summarise to gene level
g <- reduceMC(t)
g$transcript_id <- NA_character_
## Now annotate the regions
regions <- defineRegions(genes = g, transcripts = t, exons = e)
sort(unlist(regions))
#> GRanges object with 9 ranges and 3 metadata columns:
#> seqnames ranges strand | region
#> <Rle> <IRanges> <Rle> | <character>
#> distal_intergenic chr1 1-10000 * | Intergenic (>10kb)
#> proximal_intergenic chr1 10001-15000 * | Intergenic (<10kb)
#> upstream_promoter chr1 15001-17500 * | Upstream Promoter (-..
#> promoter chr1 17501-22500 * | Promoter (-2500/+500)
#> exon chr1 22501-23000 * | Exon
#> intron chr1 23001-29000 * | Intron
#> exon chr1 29001-30000 * | Exon
#> proximal_intergenic chr1 30001-40000 * | Intergenic (<10kb)
#> distal_intergenic chr1 40001-50000 * | Intergenic (>10kb)
#> gene_id transcript_id
#> <CharacterList> <CharacterList>
#> distal_intergenic <NA> <NA>
#> proximal_intergenic Gene1 <NA>
#> upstream_promoter Gene1,Gene1 Trans1,Trans2
#> promoter Gene1,Gene1 Trans1,Trans2
#> exon Gene1 Trans2
#> intron Gene1 <NA>
#> exon Gene1,Gene1 Trans1,Trans2
#> proximal_intergenic Gene1 <NA>
#> distal_intergenic <NA> <NA>
#> -------
#> seqinfo: 1 sequence from an unspecified genome
## Alternatively, collpse gene body and intergenic ranges
regions <- defineRegions(
genes = g, transcripts = t, exons = e, intron = FALSE, proximal = 0
)
sort(unlist(regions))
#> GRanges object with 5 ranges and 3 metadata columns:
#> seqnames ranges strand | region
#> <Rle> <IRanges> <Rle> | <character>
#> intergenic chr1 1-15000 * | Intergenic
#> upstream_promoter chr1 15001-17500 * | Upstream Promoter (-..
#> promoter chr1 17501-22500 * | Promoter (-2500/+500)
#> gene_body chr1 22501-30000 * | Gene Body
#> intergenic chr1 30001-50000 * | Intergenic
#> gene_id transcript_id
#> <CharacterList> <CharacterList>
#> intergenic <NA> <NA>
#> upstream_promoter Gene1,Gene1 Trans1,Trans2
#> promoter Gene1,Gene1 Trans1,Trans2
#> gene_body Gene1 <NA>
#> intergenic <NA> <NA>
#> -------
#> seqinfo: 1 sequence from an unspecified genome