Legoplots in R (3D barplots in R)

I previously mentioned these, but I thought about it some more and went ahead and wrote an R implementation.  With very little work you could rejig this code to display all kinds of 3D bar plots; I’ve commented the code thoroughly so it should be easy to follow.  Immediately below is an example plot and below that is a gallery comparing an original plot by the Broad Institute with two versions produced by my code.  I wouldn’t want to slavishly copy, so I’ve added an optional transparency effect that is both useful and attractive.

My code with transparency
All 96 different mutation/context combinations, with added transparency

 

 

Arguments

  • z : a named vector of length 96 representing each of the 6 possible mutations in 16 different sequence contexts.  Note that the names are hard-coded and can be found in the code below.  Values represent the height of each column
  • alpha : transparency of column sides.  Top is always opaque.  Default is 1 (opaque)
  • scalexy : value to scale the area of all columns.  Small values lead to skinny columns, larger values lead to fatter columns.  Default is 10
  • scalez : value to scale the height of all columns; see notes below.  Default is 1
  • gap : a value determining the space between each column.  Default is 0.2

Things to note

  • It was essential to include the lit=FALSE argument when drawing surfaces, as the default lighting makes everything look very shiny and pretty ugly.  I also had to change the field of view (FOV) to 0, as the default value gives a slightly wall-eyed view.
  • The scalexy argument scales both the “fatness” of the columns and the gaps between them.
  • Scaling z using the scalez argument is useful if your data is very flat and you want to emphasize it.  It will have knock-on effects on the scale that appears at the side, so be careful.  If you’re going to transform your data, it’s probably best to do so before putting it into this function.
  • The alpha transparency I added to the column sides is very useful for viewing columns that are obstructed by other columns and has the added benefit of looking pretty cool.
  • I haven’t included the wireframe axis background, I leave that as an exercise for the reader, or maybe I’ll add it later.

Code is available below and example data is included if you check out my GitHub here.

## This script creates a "legoplot" similar to those produced by the Broad Institute
## The plot shows the relative abundance of each of the 6 possible mutations in the
## 16 sequence contexts

## Load packages
library(rgl)

#### START OF FUNCTIONS

## Functions modified from the "demo(hist3d)" examples in the rgl package:
# library(rgl)
# demo(hist3d)
## Note; would it have killed the original author to comment their code?

## Draws a single "column" or "stack".
## X and Y coordinates determine the area of the column
## The Z coordinate determines the height of the column
## We include "lit=FALSE" arguments to remove the nasty shiny surfaces caused by lighting
stackplot.3d<-function(x,y,z,alpha=1,topcol="#078E53",sidecol="#aaaaaa"){

## These lines allow the active rgl device to be updated with multiple changes
## This is necessary to draw the sides and ends of the column separately
save on.exit(par3d(save))

## Determine the coordinates of each surface of the column and its edges
x1=c(rep(c(x[1],x[2],x[2],x[1]),3),rep(x[1],4),rep(x[2],4))
z1=c(rep(0,4),rep(c(0,0,z,z),4))
y1=c(y[1],y[1],y[2],y[2],rep(y[1],4),rep(y[2],4),rep(c(y[1],y[2],y[2],y[1]),2))
x2=c(rep(c(x[1],x[1],x[2],x[2]),2),rep(c(x[1],x[2],rep(x[1],3),rep(x[2],3)),2))
z2=c(rep(c(0,z),4),rep(0,8),rep(z,8) )
y2=c(rep(y[1],4),rep(y[2],4),rep(c(rep(y[1],3),rep(y[2],3),y[1],y[2]),2) )

## These lines create the sides of the column and its coloured top surface
rgl.quads(x1,z1,y1,col=rep(sidecol,each=4),alpha=alpha,lit=FALSE)
rgl.quads(c(x[1],x[2],x[2],x[1]),rep(z,4),c(y[1],y[1],y[2],y[2]),
col=rep(topcol,each=4),alpha=1,lit=FALSE)
## This line adds black edges to the column
rgl.lines(x2,z2,y2,col="#000000",lit=FALSE)
}
# Example:
# stackplot.3d(c(0,1),c(0,1),3,alpha=0.6)

## Calls stackplot.3d repeatedly to create a barplot
## z is the heights of the columns and must be an appropriately named vector
context3d<-function(z,alpha=1,scalexy=10,scalez=1,gap=0.2){
## These lines allow the active rgl device to be updated with multiple changes
## This is necessary to add each column sequentially
save on.exit(par3d(save))

## Recreate Broad order
types=c("C.G.G.C","T.A.A.T","C.A.G.T","T.G.A.C","C.T.G.A","T.C.A.G")
contexts=c("TxT","CxT","AxT","GxT","TxC","CxC","AxC","GxC",
"TxA","CxA","AxA","GxA","TxG","CxG","AxG","GxG")
typeorder=c()
for(type in types){
typeorder=c(typeorder,paste(type,contexts,sep="_"))
}
z=z[typeorder]

## Reorder data into 6 regions
set1=c(1:4,17:20,5:8,21:24,9:12,25:28,13:16,29:32)
set2=set1+32
set3=set1+64
neworder=c(set1,set2,set3)

## Define dimensions of the plot
dimensions=c(12,8)

## Scale column area and the gap between columns
y=seq(1,dimensions[1])*scalexy
x=seq(1,dimensions[2])*scalexy
gap=gap*scalexy

## Scale z coordinate
z=z*scalez

## Set up colour palette
broadcolors=c("#805D3F","#72549A","#5EAFB2","#3F4F9D","#F2EC3C","#74B655")
colors=as.vector(sapply(broadcolors,rep,16))

## Plot each of the columns
for(i in 1:dimensions[1]){
for(j in 1:dimensions[2]){
it=(i-1)*dimensions[2]+j # Variable to work out which column to plot; counts from 1:96
stackplot.3d(c(gap+x[j],x[j]+scalexy),
c(-gap-y[i],-y[i]-scalexy),
z[neworder[it]],
alpha=alpha,
topcol=colors[neworder[it]],
sidecol=colors[neworder[it]])
}
}
## Set the viewpoint and add axes and labels
rgl.viewpoint(theta=50,phi=40,fov=0)
axes3d("y-+",labels=TRUE)
}
# Example:
# context3d(counts)

#### END OF FUNCTIONS

## Read in example data and cast to an appropriate vector
rawdata=read.table("snvspermegabase.txt",header=TRUE)
counts=as.numeric(rawdata)
names(counts)=colnames(rawdata)

## Example plots
context3d(counts)
context3d(counts,alpha=0.4)

## Save your images to files if you wish
rgl.snapshot(filename="example.png")

R, GRanges objects and intersecting them

I discovered unpredicted behavior when intersecting GRanges objects; this is how to avoid the same problem I found.  Maybe my intuition wasn’t good, but this represents the a bad type of bug; a silent one.  If I had not spotted it, I would have lost several nonsynonymous KRAS mutations in my dataset.

If you want to represent a long list of genomic coordinates in R, you probably want to use a GRanges object.  It’s a very space and time efficient way of storing the data, even if it can be a little difficult to get accustomed to.  It’s ideal for representing bed files, aligned sequence data, SNV and indel coordinates, annotations and so forth.

In my case, I was intersecting two GRanges objects; I wanted to see which of my SNVs were within the boundaries of an exome capture kit.  In the example below, we use three exome capture regions (one from chromosome 1, exon 2 of KRAS and one from chromosome Y) and three consecutive SNVs i.e. SNVs that occur at immediately adjacent positions.  I could have just used exon 2 of KRAS, but I wanted to demonstrate the code can handle multiple locations on different chromosomes at once.

The SNVs could be in the same sample or different samples, it doesn’t matter; what matters is that their coordinates are contiguous.  The contiguous SNVs are collapsed into a single window (which is undesirable), then the regions are expanded back to single base-pair sites.  You of course can’t do this with indels of longer than 1bp.

We could have avoided this bug entirely if we’d worked in reverse and used setdiff() instead of intersect() to identify SNVs that don’t fall within the exome.

# Load packages
library(GenomicRanges)

## Create GRanges object of three exons
grexome=GRanges(seqnames=Rle(c(1,12,"Y")),
ranges=IRanges(c(13403,25398206,27010592),
end=c(13639,25398320,27010664)))

## Create GRanges object of three consecutive SNVs and
## two others on different chromosomes
grsnvs=GRanges(seqnames=Rle(c(1,rep(12,3),"Y")),
ranges=IRanges(c(13500,25398284,25398285,25398286,27010600),
end=c(13500,25398284,25398285,25398286,27010600)))

## Intersect the SNVs with the exome.
## You DO NOT get five hits back, but three.
## The 3 SNVs on chr12 are reported as a single, contiguous region
grhits=intersect(grsnvs,grexome)

## We expand the region back to the original, single-base sites
## and paste the appropriate chromosome to it, separated by a colon.
## Some duplication occurs, so we remove this using unique()
allpositions=as.data.frame(mapply(":",start(grhits),end(grhits)))
hits=as.vector(apply(allpositions,1,function(x){
paste0(as.vector(seqnames(grhits)),":",x)
}))
hits=sort(unique(hits))

## If you want a GRanges object again, just split the
## output back into its components and rebuild.
## Final line demonstrates that our results are identical to the input
chromosomes=sapply(strsplit(hits,":"),"[",1)
positions=as.numeric(sapply(strsplit(hits,":"),"[",2))
grsnvhits=GRanges(seqnames=Rle(chromosomes),ranges=IRanges(positions,end=positions))
identical(grsnvs,grsnvhits)

 

Understanding BAM files; part 2

Now that you have the basics after part 1, let’s look at something more complicated; the second column of the main body of BAM files, the bitwise flag.

This isn’t simple.  Both the samtools manual and even the SAM specification ostensibly explain what the flag is, but do so in a really opaque way, a sort of “if you have to ask, you’ll never know” situation.  The bitwise flag looks like an integer, but it’s not.  You might see “99” or “147” and think it’s a score of some kind, but it’s actually a binary encoding of a hexadecimal value.  The data stored in this flag is crucial, including things like if the read is a duplicate or doesn’t pass quality controls.

The primary effect that this has is that you can kiss goodbye to doing anything complex  with BAM files using standard command-line text parsing tools.  If you want to really understand what is encoded in your BAM files, you’re going to have to get serious and become familiar with the samtools API, or some interface to it.  More on that in the next part.

If you want a specific flag explained, you can use this handy calculator.  But where are those values coming from?

With the SAM bitwise flag, there are 12 properties, which are TRUE or FALSE, 1 or 0.  This gives you a string of 12 digits, which can either be 1 or 0.  This 12-digit binary encoding of a hexadecimal number is converted to a decimal number, as it takes up less space.  Below is a summary; the first value is the binary hexadecimal, the second is the decimal and the text is what it represents:

  1. 000000000001 : 1 : read paired
  2. 000000000010 : 2 : read mapped in proper pair
  3. 000000000100 : 4 : read unmapped
  4. 000000001000 : 8 : mate unmapped
  5. 000000010000 : 16 : read reverse strand
  6. 000000100000 : 32 : mate reverse strand
  7. 000001000000 : 64 : first in pair
  8. 000010000000 : 128 : second in pair
  9. 000100000000 : 256 : not primary alignment
  10. 001000000000 : 512 : read fails platform/vendor quality checks
  11. 010000000000 : 1024 : read is PCR or optical duplicate
  12. 100000000000 : 2048 : supplementary alignment

To arrive at the final value, we just add the TRUE values together; you can see how easy this is when the binary hexadecimal values are stacked vertically.  Any FALSE values are ignored, as they are zero.

For example; a read that is paired (1), in a proper pair (2), with a mate on the reverse strand (32) and is the first in a pair (64) has a flag of 99 because:

1+2+32+64=99

The corresponding read in the pair would be a read that is paired (1), in a proper pair (2), is on the reverse strand (16) and is the second in a pair (128) has a flag of 147 because:

1+2+16+128=147

You should see a lot of pairs of reads with 99 and 147, and also 83 and 163, which is the same but with the reads in the opposite orientation.  I would advise against trying to parse SAM flags as decimal values and adopt a nice programmatic way of accessing them, which I’ll write about next time.

Understanding BAM files; part 1

SAM and BAM files are considerably more complex than they first look.  The SAM format specification hides the complexity well and it is easy to deceive yourself into doing the wrong thing.  That document is worth reading, but I will try to simplify it further here.  SAM files (Sequence Alignment Map) contain short DNA sequences aligned to a reference genome.  BAM files are simply the binary version, which means that they are compressed and about 1/3rd the size.  A SAM file is just a text file and can be read by anything; a BAM file can only be accessed by using particular software; e.g. samtools.

There are two parts to a SAM file; the header and the body.  The header contains information about the file, usually the following fields;

  1.  @HD line tells you the version number of SAM specification and whether or not the file is sorted.  Sorting by genomic coordinates is the most common type
  2. @SQ lines tell you which reference sequences have been used to align the sequence again.  There will be one line for every contig (e.g. a chromosome), which the sequence name and length (e.g. SN:1 LN:249250621)
  3. @RG line.  The Read Group line contains tags detailing the origins of the input data
  4. @PG lines tell you programs and commands used to create the file

SAM headers are not immutable, so don’t put too must trust in them.  You can see the header of a BAM file by using this samtools command:

samtools view -H bamfile.bam

The body of the file is where the aligned data is stored.  There is one row for every read.  Therefore, if you’re looking at paired end data (e.g. Illumina sequencing data) there will be two rows for every pair; one for the forward read and one for the reverse.

There are at least eleven columns in each row.  There may also be further tags appended at the end of these to denote extra data, like mapping scores, RG tags and so forth.

  1.  QNAME: the name of the query sequence e.g. a read name like HWI-ST886:247:D2G2MACXX:3:2102:4127:4699
  2. FLAG: a bitwise flag.  You’ll want to read Understanding BAM files; part 2 for an explanation
  3. RNAME: the name of the reference contig the sequence is aligned to e.g. 1 (chromosome 1)
  4. POS: the position on the reference contig that the alignment starts at e.g. 123456 (in this case, the 123,456th base of chromosome 1)
  5. MAPQ: the mapping quality of the read; maximum value is usually 60.  This number is Phred scaled, meaning that they’re logarithmically scaled.  For example:
    1. MAPQ of 10 means a 1/10 chance that the mapping is wrong
    2. MAPQ of 20 means a 1/100 chance that the mapping is wron
    3. MAPQ of 60 means a 1/1,000,000 chance that the mapping is wrong
  6. CIGAR: this string tells us how to match the query sequence to the reference sequence.  In the simplest case, say the read is 100 bases along and matches perfectly; CIGAR would be 100M.  However, you can encode mismatches, insertions and deletions and it quickly becomes very complicated
  7. RNEXT: the name of the reference contig that the other read in the pair aligns to.  If identical to this read, an equals sign (“=”) is used
  8. PNEXT: the position on the reference contig that the other read in the pair aligns to.  This depends on the size of the DNA fragments in the sequencing library and will usually be a few hundred base pairs away (see TLEN, below)
  9. TLEN: the length of the template fragment.  This is the distance from the leftmost base of the first read to the rightmost base of the second read of a pair.  This value is signed (+/-), so you can infer the orientation of the reads, too
  10. SEQ: the DNA sequence of the query sequence.  This will be identical to the sequence in the FASTQ file that was aligned to the reference genome
  11. QUAL: the corresponding base quality scores of the SEQ.  This will be identical to the scores in the original FASTQ file.  Note that it’s encoded as ASCII+33, so all the letters and punctuation symbols actually correspond to regular integers

Next time, we’ll look at the bitwise flag in more detail, as this is a major source of confusion.

MuTect – a pleasant surprise

I was sceptical (not a typo, I’m British).  Who wouldn’t be?  I’d invested literally hundreds of hours into my sequencing pipeline which depended almost entirely around GATK and even recently re-tooled to accommodate the sweeping changes brought by GATK 2.0 and it’s great, truly.  However….

I had been enviously eyeing up MuTect for quite some time, as it was always built from the ground-up to serve one of the purposes I am currently knee-deep in; finding mutations in tumour-normal pairs.  To be clear, I mean comparing sequencing data from a tumour samples with a normal sample from the same patient.  On the other hand, I gather GATK has its roots in GWAS studies and has always been more population-focussed.

The problem we’re all wrestling with currently – which, by it’s very nature, is never going away – is intraclonal heterogeneity, the idea that cancers are not monocultures of billions of copies of the same cell, but are highly heterogeneous mixes of potentially thousands (millions?) of individual clones, all of which are descended from (probably) one originator cell.  These clones compete with one another for resources in their niche and are effectively locked in an arms-race with one another, as well as with the host body.  Cancer patients are not facing a single entity, but a legion of cells that reinvent, mutate and transfigure themselves in the best Darwinian tradition.

As it stands, sequencing data isn’t a great read-out, as it’s produced from DNA pooled from thousands of tumour cells.  Even if you use a low-starting-amount protocol and had 50 ng of tumour DNA with no amplification steps, that’s still:

50 ng / 6 pg (DNA content of a single diploid cell) = >8000 cells

I guess if these cells are from a monocellular and therefore easier to purify population of cells (as is the case for myeloma – you can isolate tumour cells by selecting for CD138+ cells), that’s not terrible news; but what if your sample is a solid tumour?  Who knows what kind of structural heterogeneity you need to contend with?

Still, your sample is “bulk DNA” and is a sampling of potentially thousands of separate genomes compressed into a single readout.  I’m sure I’ll post more on the subject at a later date, but for now let’s say that it’s a definite issue.

In a homogeneous population of cells, SNVs (single nucleotide variants) will be present in all cells, but this will not be the case in a heterogeneous population of cancer cells.  MuTect’s real power lies is its ability to detect these subclonal mutations.  The detection is limited mostly by the depth of the sequencing, with quickly diminishing returns – supplementary table 1 in the MuTect paper does a good job of illustrating this, particularly if you turn on conditional formatting in Excel to turn the table of values into a heatmap.  Everybody loves conditional formatting!

MuTect’s output is not without issues, but I’ll discuss further filtering of the output next time.

Finally, it’s worth noting that YES, MuTect does output in VCF format… it’s just undocumented.  If you ask very nicely in their forums, the authors will help you.

GPUs and Bioinformatics

The ICR (note fabulous new branding) organised a half-day  seminar on “GPU Acceleration in Bioinformatics”.  NVIDIA were kind enough to sponsor the event, which meant flying speakers in from as far as Russia and Hong Kong.  Thanks, NVIDIA!

All computers contain at least one processor, the CPU (central processing unit), but many also contain a specialised secondary processor which is geared toward doing the calculations necessary to produce graphics quickly, the GPU (graphics processing unit).  Although these may be embedded on the motherboard of your PC (or even on the same die as the CPU in some cases) for serious amounts of graphical power you’ll need a separate device, a so-called graphics card.

The first question to address is why would you want to use a GPU at all?  In desktop scenarios, this is easy because many users have a GPU in their machine doing almost nothing most of the time.  What if you could make use of all this power?  You’ll call it “hardware acceleration”.  Microsoft recently posted a very informative article outlining how they’ve baked this kind of GPU utilisation right into Windows 8 to speed up everything from the obvious (3D game performance) to the seemingly mundane (text rendering).  Even so, the the majority of the power of a GPU is still largely unused in a desktop machine when not gaming.

However, the situation in computational biology is not so simple, as most of my compute is performed on Linux clusters, nowhere near a traditional desktop computer, with no GUI and therefore no apparent need for a GPU.  At first glance, you might suspect that bioinformaticians just wanted an excuse to get their funders to furnish them with all-singing-all-dancing gaming rigs, but this isn’t the case at all.  GPUs are exceptionally good at performing huge numbers of similar operations in parallel and it just so happens that many of the problems we have in biology (a great example being sequence alignment to a reference genome) are exactly these sort of embarrassingly parallelisable problems.

Having established the potential need for GPUs, you need to choose a language; CUDA (NVIDIA specific) or openCL (an open standard).  I’m not a computer-scientist, but I am a gamer and I’ve noted that NVIDIA cards tend to be better (for my purposes) than ATi cards, though I’ve only switched to NVIDIA in the last few years.  This was partly because I noticed that non-gaming applications I use  such as Handbrake (a video transcoder) would be much faster using a CUDA-enabled card.  You might count that as anecdotal evidence that CUDA is more widely supported and used when even consumers have heard of it.  OpenCL offers hardware-agnostic portability and academics will often vouch for open solutions but as I have a preference for NVIDIA cards, certain apps I like run in CUDA already and anything written in OpenCL would run on an NVIDIA card but not vice-versa… the case for NVIDIA seems clear.

The talks themselves were very good, with highlights including:

SOAP3-dp.  A new version of the SOAP aligner has just been released.  It looks quite exciting as it now runs on GPUs, running “easy” alignments via two-way Burrows-Wheeler transforms (i.e. it works like BWA) on the GPU and leaving harder reads to the CPU , so much like running Stampy in hybrid mode with BWA, which has been my preferred approach.  I guess I need to run some comparisons.  On the other hand, I guess we need to get our GPU server up and running, first.  Soon.

Coincidentally, the very next day I received an automated update from Gerton Lunter that a new version of Stampy had been released.  It’s different in two main ways; firstly, it’s now multithreaded so should be substantially faster.  Secondly, it no longer runs concurrently with BWA; rather, you align first using BWA and then allow Stampy to realign any poorly aligned reads.  You could run Stampy on any bam file, so maybe we might end up using a SOAP3-dp/Stampy hybrid.  Who can say?  I recently had a disagreement with someone who was using Novoalign, which I gather is much like running Stampy/BWA, but costs actual money.  Proprietary hardware and even operating systems?  Fine.  Proprietary aligners?  I think not, at least not without some serious empirical evidence.

Unipro UGENE.  All the way from Novosibirsk in Russia came Yuriy Vaskin to show off the UGENE software.  I’d used it briefly before and have been generally impressed by the sheer number of features that they’ve packed into it – everything from (multiple) multiple alignment tools, a full workflow manager (think Taverna Workbench) and even the world’s only Windows-based implementation of the BWA aligner.  I’m very biased toward the command-line and scripting while at work so I’m not sure there’s a place for a GUI-based program in my workflow, but I can certainly see the utility for such software, especially for groups without a dedicated bioinformatician.  My interest is not particularly using UGENE as a general workbench, but as an enhanced genome browser.  I currently use IGV to view our bam files; users can access data locked away on our Linux servers from their Windows (or Mac, or Linux if we had any of those) desktops as I followed the instructions here (http://www.broadinstitute.org/igv/DataServer); very handy indeed.  Users are shielded from a crash-course in SSHing into a Linux server and the data is shielded from the users, who are only human (like me) and bound to accidentally delete or corrupt it eventually.  However, sometimes IGV feels both a little ugly and sometimes a bit bare-bones.  I guess a short attention-span and the urge for shiny novelty are common issues in the 21st century.  In the age of buttery-smooth apps running at 60fps on our smartphones, Java-swing desktop apps do look a bit ugly, but it would be foolish to judge software on “shinyness” over utility, particularly when IGV can be run natively on any platform with a Java Virtual Machine.

I was disappointed to discover that on my (fairly wimpy) Intel Core2 Duo workstation with 8GB of RAM UGENE took several hours to index a 14GB BAM file, when I can view the same file near-instantly using IGV and a standard bai index file.  Despite this there’s plenty of reasons to recommend UGENE; it’s not a great genome browser, but I feel it’s important to suffix that statement with the word “yet”.

Finally, there was some debate over why you should invest so much money in the enterprise-standard NVIDIA Tesla cards when you could feasibly use their much cheaper and consumer focussed (i.e. gaming) GeForce cards.  There are several good reasons.  While consumer cards are indeed cheaper, this is because they are produced in enormous quantities while attempting to squeeze the maximum profit out of them, meaning that they’re clocked at higher speeds and built from less reliable parts and may even have reduced feature/instruction sets.  Even playing your favourite game for 30+ hours a week with all the dials turned to maximum does not compare to the stress a Tesla card might face when consistently hammered with jobs 24/7/365 in a server.  It’s the same story with all computer hardware;  server hard drives are more expensive because they’re more robust and you are of course also paying for support which you just don’t receive or even want as a consumer.  It’s also worth noting that consumer cards have a different form factor and might not fit into a server in the first place or may block the airflow across the system in a catastrophic way.  You might build your own compute farm from consumer cards if you were, say, running some sort of Bitcoin mining operation at home, but you’ll really need Tesla cards if you’re doing science or engineering.  I suppose you could do some testing and local development using consumer cards, but when in full production you just need to spend the money.  You do want an accurate, repeatable answer, don’t you?

Thanks to the speakers for making me think, to NVIDIA for sponsoring the event and Igor Kozin from the ICR’s Scientific Computing Team for organising it.  Maybe Some rough results when I start toying with GPU-accelerated aligners will show up on this blog at some point?

GATK 2.X is here

If you’re in the business of variant discovery (and I am, amongst other things), you’ll know that things are changing and changing fast. A few years ago if you’d asked me what a SNV (Single Nucleotide Variant) was, I might have thought that you’d misspelled SNP (Single Nucleotide Polymorphism); in other words, I wouldn’t have known because the term wasn’t common parlance. Now the situation seems to have reversed, where I can confidently tell you that a SNV is any single nucleotide change in a sequence relative to the reference sequence (usually we’re talking GRCh37/hg19) and you’re probably referring to one of the six possible point mutations such as A>T. I say that only six are possible, because in double-stranded DNA, A>T and T>A are equivalent, depending on which strand you’re looking at. I prefer not to refer to single-base insertions or deletions (indels) as SNVs, but others do, so there’s still ambiguity even in the new nomenclature. Indels are another story for another day.

A former version of myself might have told you that SNPs were simple polymorphisms between individuals, but now I’m not so sure. Where does the boundary lie between what is a polymorphism distributed across the population and what is a variant, specific to an individual? The boundaries seem artificial and arbitrary. If we accept a SNP to be inherited and SNVs to be de novo events gained in an individual, do SNVs become SNPs in the next generation? Besides, if a mutation occurs de novo in multiple individuals, it is still present in the population, so is it a SNP or a mutant?

When you consider that dbSNP no longer contains just regular polymorphisms, but all kinds of variation, including indels and deleterious mutations it’s better, I say, to discard the term SNP altogether and start referring to them as SNVs or simply variants.

I was also fairly distrustful of SNP arrays owing to endless (and continuing) bad experiences with expression arrays, but the fact that DNA sequencing and SNP arrays tend to have very high concordance with one another has helped make me a believer, both in SNP arrays and next generation sequencing. If you’re doing sequencing and looking for such variants, you’re probably using the Broad Institute’s GATK (Genome Analysis ToolKit) to identify or “call” them.

The GATK (I say “gatt-kay” but others seem to prefer to spell out the letters) is a pretty nifty bit of kit. The Broad are well funded and very, very good at what they do. This is what you get when an interdisciplinary team of biologists, mathematicians and software engineers work together. GATK is a Java framework that does many of the things you want to do with your aligned BAM files and it’s designed for compute clusters, with all kinds of multicore trickery and scatter/gather (or map/reduce, if you prefer) goodness. It’s even got a lovely Scala wrapper called Queue to build and run entire analysis pipelines, though sadly it’s aimed primarily at Platform LSF whereas we use a MOAB/TORQUE job scheduler here. I’m sure we’ll get them playing nicely one day, but right now I’ve got a Perl script to run things for me. All in all, GATK is beautifully engineered and currently peerless, although I’m very much looking forward to seeing what the Sanger Centre release in the next 12-24 months.

For me, GATK’s primary use is SNV calling, which it’s very good at, if perhaps a little aggressive; nothing some post-calling filtering can’t fix, though. There have been regular incremental releases over the past year and then there were a few silent months, which made me suspicious. The Broad certainly weren’t going to abandon it, so I wasn’t terribly surprised to see a 2.0 release a few weeks ago. It’s been galloping forward and the version numbers are increasing rapidly.

Positively, it’s still openly accessible to all, but curiously now has a mixed open/closed source model. I suspect that this is monetize the program (to commercial entities, I’m not suggesting that the Broad would attempt to charge academic users), or more probably to keep unwanted third parties from monetizing it themselves by selling analysis; such are the risks of open-source software. I’m no open-source zealot and I have room in my life for both open and closed software; perhaps I’ll write about this another time.

My only issues so far are that when I receive a “[gsa-announce] GATK version X.Y-Z released” email from their mailing list, I can no longer do the following to rapidly update my installation:

git pull
ant clean
ant

More correctly I can still do this, but what I pull down is the GATKlite, which is the open-source arm of GATK. At least I can still read the git comments to see what changes with each version. For the full version I’m forced to use a web interface to manually download binaries that I then push to my servers, which is inconvenient. Still, this may change in future, as everything is in flux, not least because the group appear to be in the grip of shifting everything from their old Wiki to their new Vanilla website. I’m a bit sorry to see their getsatisfaction website go, as I rather liked it and whenever I asked questions I was promptly answer. Pretty good support for free.

There doesn’t appear to be a published roadmap of features for the future, but right now I’m excited about the following two features (and you should be as well):

1.) BQSR v2. Improved (faster, more complex, more accurate, far better for indels) base quality score recalibration

2.) The HaplotypeCaller. This is the successor to the UnifiedGenotyper which I’ve been using with very satisfactory results to call SNVs. The major improvement is that it now includes de novo assembly to help call indels.  I’m talking real, de Bruijn graph-based assembly, just like Velvet. This means that we can hopefully say goodbye to seemingly endless lists of false-positives and hello to real indels!

It’ll be a little while before I shift completely to GATK2, but right now I’m evaluating it and comparing calls produced using the final 1.6 series of GATK and the future is looking very bright indeed. Superior tools lead to superior results.