GATK 2.X is here

If you’re in the business of variant discovery (and I am, amongst other things), you’ll know that things are changing and changing fast. A few years ago if you’d asked me what a SNV (Single Nucleotide Variant) was, I might have thought that you’d misspelled SNP (Single Nucleotide Polymorphism); in other words, I wouldn’t have known because the term wasn’t common parlance. Now the situation seems to have reversed, where I can confidently tell you that a SNV is any single nucleotide change in a sequence relative to the reference sequence (usually we’re talking GRCh37/hg19) and you’re probably referring to one of the six possible point mutations such as A>T. I say that only six are possible, because in double-stranded DNA, A>T and T>A are equivalent, depending on which strand you’re looking at. I prefer not to refer to single-base insertions or deletions (indels) as SNVs, but others do, so there’s still ambiguity even in the new nomenclature. Indels are another story for another day.

A former version of myself might have told you that SNPs were simple polymorphisms between individuals, but now I’m not so sure. Where does the boundary lie between what is a polymorphism distributed across the population and what is a variant, specific to an individual? The boundaries seem artificial and arbitrary. If we accept a SNP to be inherited and SNVs to be de novo events gained in an individual, do SNVs become SNPs in the next generation? Besides, if a mutation occurs de novo in multiple individuals, it is still present in the population, so is it a SNP or a mutant?

When you consider that dbSNP no longer contains just regular polymorphisms, but all kinds of variation, including indels and deleterious mutations it’s better, I say, to discard the term SNP altogether and start referring to them as SNVs or simply variants.

I was also fairly distrustful of SNP arrays owing to endless (and continuing) bad experiences with expression arrays, but the fact that DNA sequencing and SNP arrays tend to have very high concordance with one another has helped make me a believer, both in SNP arrays and next generation sequencing. If you’re doing sequencing and looking for such variants, you’re probably using the Broad Institute’s GATK (Genome Analysis ToolKit) to identify or “call” them.

The GATK (I say “gatt-kay” but others seem to prefer to spell out the letters) is a pretty nifty bit of kit. The Broad are well funded and very, very good at what they do. This is what you get when an interdisciplinary team of biologists, mathematicians and software engineers work together. GATK is a Java framework that does many of the things you want to do with your aligned BAM files and it’s designed for compute clusters, with all kinds of multicore trickery and scatter/gather (or map/reduce, if you prefer) goodness. It’s even got a lovely Scala wrapper called Queue to build and run entire analysis pipelines, though sadly it’s aimed primarily at Platform LSF whereas we use a MOAB/TORQUE job scheduler here. I’m sure we’ll get them playing nicely one day, but right now I’ve got a Perl script to run things for me. All in all, GATK is beautifully engineered and currently peerless, although I’m very much looking forward to seeing what the Sanger Centre release in the next 12-24 months.

For me, GATK’s primary use is SNV calling, which it’s very good at, if perhaps a little aggressive; nothing some post-calling filtering can’t fix, though. There have been regular incremental releases over the past year and then there were a few silent months, which made me suspicious. The Broad certainly weren’t going to abandon it, so I wasn’t terribly surprised to see a 2.0 release a few weeks ago. It’s been galloping forward and the version numbers are increasing rapidly.

Positively, it’s still openly accessible to all, but curiously now has a mixed open/closed source model. I suspect that this is monetize the program (to commercial entities, I’m not suggesting that the Broad would attempt to charge academic users), or more probably to keep unwanted third parties from monetizing it themselves by selling analysis; such are the risks of open-source software. I’m no open-source zealot and I have room in my life for both open and closed software; perhaps I’ll write about this another time.

My only issues so far are that when I receive a “[gsa-announce] GATK version X.Y-Z released” email from their mailing list, I can no longer do the following to rapidly update my installation:

git pull
ant clean
ant

More correctly I can still do this, but what I pull down is the GATKlite, which is the open-source arm of GATK. At least I can still read the git comments to see what changes with each version. For the full version I’m forced to use a web interface to manually download binaries that I then push to my servers, which is inconvenient. Still, this may change in future, as everything is in flux, not least because the group appear to be in the grip of shifting everything from their old Wiki to their new Vanilla website. I’m a bit sorry to see their getsatisfaction website go, as I rather liked it and whenever I asked questions I was promptly answer. Pretty good support for free.

There doesn’t appear to be a published roadmap of features for the future, but right now I’m excited about the following two features (and you should be as well):

1.) BQSR v2. Improved (faster, more complex, more accurate, far better for indels) base quality score recalibration

2.) The HaplotypeCaller. This is the successor to the UnifiedGenotyper which I’ve been using with very satisfactory results to call SNVs. The major improvement is that it now includes de novo assembly to help call indels.  I’m talking real, de Bruijn graph-based assembly, just like Velvet. This means that we can hopefully say goodbye to seemingly endless lists of false-positives and hello to real indels!

It’ll be a little while before I shift completely to GATK2, but right now I’m evaluating it and comparing calls produced using the final 1.6 series of GATK and the future is looking very bright indeed. Superior tools lead to superior results.

Perl + shell = Sherl

Backticks. Technically speaking, what I’m advocating is probably bad form and frowned upon by real computer scientists, but I love system calls from within Perl scripts. Why would I write several lines of Perl to query what’s in a directory or write a temporary file when I could just use bash shell scripting commands wrapped in backticks?

## Use sed from within Perl to edit a file
`sed -i 's/something/$somethingelse/g' somefile.txt`;
## Read a list of chromosomes directly into an array
my @chromos = `cat /home/cwardell/b37/human_g1k_v37.fasta.fai | awk '{print \$1}' `;

If we consider the ends to be justified by the means (let ends = “getting as much work done in the most time-efficient way possible” and means = “a bit of hackery”), I see no real problem with it. This is a great deal of my code ends up containing a great deal of backticks. In case you’re not familiar with backticks, you can find them in the top-left of your keyboard, nestled between the Esc, Tab and 1 keys. Officially a backtick is termed the “grave accent” and is not to be confused with the apostrophe, which will do nothing for your code except throw errors, as your enclosed command will be interpreted as a string.

Bearing in mind that Perl is already an unforgivably unattractive language and you’re doing it no great aesthetic harm by throwing in some shell scripting, I therefore name this ugly mongrel mix of two ugly languages “Sherl” and I’ll keep using it because I sometimes value function over form, at least until I’m visualising data. I’m also trusting that your code is so comment-heavy that you’ll understand the what, how and why of your Sherl code.  You do obsessively comment your code, right?

What I Use

Things change quickly, particularly in bioinformatics, so I expect to write more than one of these posts in the future. As a snapshot of what I’m currently using in my day-to-day work:

Hardware:
PC: generic Dell desktop with Windows 7. Getting rid of Windows XP was a massive boost to my productivity
Primary server: Richter – a Dell PowerEdge R610 with 2x Intel Xeon X5650 processors and 48GB of RAM, running Ubuntu Linux 12.04
HPC: Dalek, Judoon and Ood. Respectively, two near identical SGI Altix 8400 clusters (32 nodes each, each identical to Richter with twice as much RAM) and an SGI Altix UV1000 with 1TB of shared RAM

I have to admit, I’m not short on compute, but space for data and analysis is much harder to come by.

Software:
Desktop –Microsoft Office for general computing.
Servers – In terms of real work, I do my best to seek out the best open-source solutions for my problems, or write my own solutions (see below). Pretty much all bioinformatics is comprised of “edge case” situations, where the edge in question is either very niche applications or problems that have never been solved before. The pieces of software that spring to mind in terms of sequencing would be the aligners I depend on (Stampy + BWA in hybrid mode), middleware like Picard and Samtools and the Broad Institute’s phenomenal Genome Analysis Tool Kit (GATK). More on this another day.

Code:
Writing code is essential to every bioinformatician, even if you’re just writing wrappers and glue to stick pre-existing programs and code together. I’m fickle, but currently keen on:

Perl – good for scripting and anything moderately complex, it’s a good workhorse, even if it’s unforgivably ugly. I’ll rather write in Python, but I’m not as strong in it and don’t yet have a compelling reason to switch.
Bash – we all love to shell script. The ease and elegance of piping the input and output of programs into one another on a Unix command line is phenomenally useful and powerful.
SQL – if you’ve got a lot of data, sooner or later you’ll need to whack it in a database and realistically, you have no other choice than a MySQL one. A friend of mine pointed out that a single well-written MySQL can be more satisfying than hundreds of lines of far more complex code. On the other hand, all the kludge that comes with MySQL is as unwelcome as it is unwieldy and ugly (particularly the Perl DBI. Yuck).
R – I’m pushing more and more of my work through R. I think this is because I love figures and working visually (i.e. interpreting data sets by creating plots and other images) and R has such powerful visualisation tools built right in (base graphics) or just a package download away (ggplot2). R is simultaneously beautifully simple and heinously broken, but you must love things because of their flaws, not in spite of them. The three primary pieces of advice I can offer new starters would be (in order of importance):

  1. Don’t give up. It’s hard, but nothing worth doing is easy
  2. Use the excellent RStudio IDE
  3. Read the R Inferno

If you’re not someone who used the 32 bit Windows version on a puny machine (no context-sensitive highlighting, no tab-completion, no easy list of variable types and dimensions) for some years and constantly ran into endless “cannot allocate vector of size…” error messages when you run out of memory, you probably can’t appreciate the joy of working with the 64 bit version on a server with dozens of GB of RAM.