GATK 2.X is here

If you’re in the business of variant discovery (and I am, amongst other things), you’ll know that things are changing and changing fast. A few years ago if you’d asked me what a SNV (Single Nucleotide Variant) was, I might have thought that you’d misspelled SNP (Single Nucleotide Polymorphism); in other words, I wouldn’t have known because the term wasn’t common parlance. Now the situation seems to have reversed, where I can confidently tell you that a SNV is any single nucleotide change in a sequence relative to the reference sequence (usually we’re talking GRCh37/hg19) and you’re probably referring to one of the six possible point mutations such as A>T. I say that only six are possible, because in double-stranded DNA, A>T and T>A are equivalent, depending on which strand you’re looking at. I prefer not to refer to single-base insertions or deletions (indels) as SNVs, but others do, so there’s still ambiguity even in the new nomenclature. Indels are another story for another day.

A former version of myself might have told you that SNPs were simple polymorphisms between individuals, but now I’m not so sure. Where does the boundary lie between what is a polymorphism distributed across the population and what is a variant, specific to an individual? The boundaries seem artificial and arbitrary. If we accept a SNP to be inherited and SNVs to be de novo events gained in an individual, do SNVs become SNPs in the next generation? Besides, if a mutation occurs de novo in multiple individuals, it is still present in the population, so is it a SNP or a mutant?

When you consider that dbSNP no longer contains just regular polymorphisms, but all kinds of variation, including indels and deleterious mutations it’s better, I say, to discard the term SNP altogether and start referring to them as SNVs or simply variants.

I was also fairly distrustful of SNP arrays owing to endless (and continuing) bad experiences with expression arrays, but the fact that DNA sequencing and SNP arrays tend to have very high concordance with one another has helped make me a believer, both in SNP arrays and next generation sequencing. If you’re doing sequencing and looking for such variants, you’re probably using the Broad Institute’s GATK (Genome Analysis ToolKit) to identify or “call” them.

The GATK (I say “gatt-kay” but others seem to prefer to spell out the letters) is a pretty nifty bit of kit. The Broad are well funded and very, very good at what they do. This is what you get when an interdisciplinary team of biologists, mathematicians and software engineers work together. GATK is a Java framework that does many of the things you want to do with your aligned BAM files and it’s designed for compute clusters, with all kinds of multicore trickery and scatter/gather (or map/reduce, if you prefer) goodness. It’s even got a lovely Scala wrapper called Queue to build and run entire analysis pipelines, though sadly it’s aimed primarily at Platform LSF whereas we use a MOAB/TORQUE job scheduler here. I’m sure we’ll get them playing nicely one day, but right now I’ve got a Perl script to run things for me. All in all, GATK is beautifully engineered and currently peerless, although I’m very much looking forward to seeing what the Sanger Centre release in the next 12-24 months.

For me, GATK’s primary use is SNV calling, which it’s very good at, if perhaps a little aggressive; nothing some post-calling filtering can’t fix, though. There have been regular incremental releases over the past year and then there were a few silent months, which made me suspicious. The Broad certainly weren’t going to abandon it, so I wasn’t terribly surprised to see a 2.0 release a few weeks ago. It’s been galloping forward and the version numbers are increasing rapidly.

Positively, it’s still openly accessible to all, but curiously now has a mixed open/closed source model. I suspect that this is monetize the program (to commercial entities, I’m not suggesting that the Broad would attempt to charge academic users), or more probably to keep unwanted third parties from monetizing it themselves by selling analysis; such are the risks of open-source software. I’m no open-source zealot and I have room in my life for both open and closed software; perhaps I’ll write about this another time.

My only issues so far are that when I receive a “[gsa-announce] GATK version X.Y-Z released” email from their mailing list, I can no longer do the following to rapidly update my installation:

git pull
ant clean

More correctly I can still do this, but what I pull down is the GATKlite, which is the open-source arm of GATK. At least I can still read the git comments to see what changes with each version. For the full version I’m forced to use a web interface to manually download binaries that I then push to my servers, which is inconvenient. Still, this may change in future, as everything is in flux, not least because the group appear to be in the grip of shifting everything from their old Wiki to their new Vanilla website. I’m a bit sorry to see their getsatisfaction website go, as I rather liked it and whenever I asked questions I was promptly answer. Pretty good support for free.

There doesn’t appear to be a published roadmap of features for the future, but right now I’m excited about the following two features (and you should be as well):

1.) BQSR v2. Improved (faster, more complex, more accurate, far better for indels) base quality score recalibration

2.) The HaplotypeCaller. This is the successor to the UnifiedGenotyper which I’ve been using with very satisfactory results to call SNVs. The major improvement is that it now includes de novo assembly to help call indels.  I’m talking real, de Bruijn graph-based assembly, just like Velvet. This means that we can hopefully say goodbye to seemingly endless lists of false-positives and hello to real indels!

It’ll be a little while before I shift completely to GATK2, but right now I’m evaluating it and comparing calls produced using the final 1.6 series of GATK and the future is looking very bright indeed. Superior tools lead to superior results.