R, GRanges objects and intersecting them

I discovered unpredicted behavior when intersecting GRanges objects; this is how to avoid the same problem I found.  Maybe my intuition wasn’t good, but this represents the a bad type of bug; a silent one.  If I had not spotted it, I would have lost several nonsynonymous KRAS mutations in my dataset.

If you want to represent a long list of genomic coordinates in R, you probably want to use a GRanges object.  It’s a very space and time efficient way of storing the data, even if it can be a little difficult to get accustomed to.  It’s ideal for representing bed files, aligned sequence data, SNV and indel coordinates, annotations and so forth.

In my case, I was intersecting two GRanges objects; I wanted to see which of my SNVs were within the boundaries of an exome capture kit.  In the example below, we use three exome capture regions (one from chromosome 1, exon 2 of KRAS and one from chromosome Y) and three consecutive SNVs i.e. SNVs that occur at immediately adjacent positions.  I could have just used exon 2 of KRAS, but I wanted to demonstrate the code can handle multiple locations on different chromosomes at once.

The SNVs could be in the same sample or different samples, it doesn’t matter; what matters is that their coordinates are contiguous.  The contiguous SNVs are collapsed into a single window (which is undesirable), then the regions are expanded back to single base-pair sites.  You of course can’t do this with indels of longer than 1bp.

We could have avoided this bug entirely if we’d worked in reverse and used setdiff() instead of intersect() to identify SNVs that don’t fall within the exome.

# Load packages
library(GenomicRanges)

## Create GRanges object of three exons
grexome=GRanges(seqnames=Rle(c(1,12,"Y")),
ranges=IRanges(c(13403,25398206,27010592),
end=c(13639,25398320,27010664)))

## Create GRanges object of three consecutive SNVs and
## two others on different chromosomes
grsnvs=GRanges(seqnames=Rle(c(1,rep(12,3),"Y")),
ranges=IRanges(c(13500,25398284,25398285,25398286,27010600),
end=c(13500,25398284,25398285,25398286,27010600)))

## Intersect the SNVs with the exome.
## You DO NOT get five hits back, but three.
## The 3 SNVs on chr12 are reported as a single, contiguous region
grhits=intersect(grsnvs,grexome)

## We expand the region back to the original, single-base sites
## and paste the appropriate chromosome to it, separated by a colon.
## Some duplication occurs, so we remove this using unique()
allpositions=as.data.frame(mapply(":",start(grhits),end(grhits)))
hits=as.vector(apply(allpositions,1,function(x){
paste0(as.vector(seqnames(grhits)),":",x)
}))
hits=sort(unique(hits))

## If you want a GRanges object again, just split the
## output back into its components and rebuild.
## Final line demonstrates that our results are identical to the input
chromosomes=sapply(strsplit(hits,":"),"[",1)
positions=as.numeric(sapply(strsplit(hits,":"),"[",2))
grsnvhits=GRanges(seqnames=Rle(chromosomes),ranges=IRanges(positions,end=positions))
identical(grsnvs,grsnvhits)

 

Leave a Reply

Your email address will not be published. Required fields are marked *