GPUs and Bioinformatics

The ICR (note fabulous new branding) organised a half-day  seminar on “GPU Acceleration in Bioinformatics”.  NVIDIA were kind enough to sponsor the event, which meant flying speakers in from as far as Russia and Hong Kong.  Thanks, NVIDIA!

All computers contain at least one processor, the CPU (central processing unit), but many also contain a specialised secondary processor which is geared toward doing the calculations necessary to produce graphics quickly, the GPU (graphics processing unit).  Although these may be embedded on the motherboard of your PC (or even on the same die as the CPU in some cases) for serious amounts of graphical power you’ll need a separate device, a so-called graphics card.

The first question to address is why would you want to use a GPU at all?  In desktop scenarios, this is easy because many users have a GPU in their machine doing almost nothing most of the time.  What if you could make use of all this power?  You’ll call it “hardware acceleration”.  Microsoft recently posted a very informative article outlining how they’ve baked this kind of GPU utilisation right into Windows 8 to speed up everything from the obvious (3D game performance) to the seemingly mundane (text rendering).  Even so, the the majority of the power of a GPU is still largely unused in a desktop machine when not gaming.

However, the situation in computational biology is not so simple, as most of my compute is performed on Linux clusters, nowhere near a traditional desktop computer, with no GUI and therefore no apparent need for a GPU.  At first glance, you might suspect that bioinformaticians just wanted an excuse to get their funders to furnish them with all-singing-all-dancing gaming rigs, but this isn’t the case at all.  GPUs are exceptionally good at performing huge numbers of similar operations in parallel and it just so happens that many of the problems we have in biology (a great example being sequence alignment to a reference genome) are exactly these sort of embarrassingly parallelisable problems.

Having established the potential need for GPUs, you need to choose a language; CUDA (NVIDIA specific) or openCL (an open standard).  I’m not a computer-scientist, but I am a gamer and I’ve noted that NVIDIA cards tend to be better (for my purposes) than ATi cards, though I’ve only switched to NVIDIA in the last few years.  This was partly because I noticed that non-gaming applications I use  such as Handbrake (a video transcoder) would be much faster using a CUDA-enabled card.  You might count that as anecdotal evidence that CUDA is more widely supported and used when even consumers have heard of it.  OpenCL offers hardware-agnostic portability and academics will often vouch for open solutions but as I have a preference for NVIDIA cards, certain apps I like run in CUDA already and anything written in OpenCL would run on an NVIDIA card but not vice-versa… the case for NVIDIA seems clear.

The talks themselves were very good, with highlights including:

SOAP3-dp.  A new version of the SOAP aligner has just been released.  It looks quite exciting as it now runs on GPUs, running “easy” alignments via two-way Burrows-Wheeler transforms (i.e. it works like BWA) on the GPU and leaving harder reads to the CPU , so much like running Stampy in hybrid mode with BWA, which has been my preferred approach.  I guess I need to run some comparisons.  On the other hand, I guess we need to get our GPU server up and running, first.  Soon.

Coincidentally, the very next day I received an automated update from Gerton Lunter that a new version of Stampy had been released.  It’s different in two main ways; firstly, it’s now multithreaded so should be substantially faster.  Secondly, it no longer runs concurrently with BWA; rather, you align first using BWA and then allow Stampy to realign any poorly aligned reads.  You could run Stampy on any bam file, so maybe we might end up using a SOAP3-dp/Stampy hybrid.  Who can say?  I recently had a disagreement with someone who was using Novoalign, which I gather is much like running Stampy/BWA, but costs actual money.  Proprietary hardware and even operating systems?  Fine.  Proprietary aligners?  I think not, at least not without some serious empirical evidence.

Unipro UGENE.  All the way from Novosibirsk in Russia came Yuriy Vaskin to show off the UGENE software.  I’d used it briefly before and have been generally impressed by the sheer number of features that they’ve packed into it – everything from (multiple) multiple alignment tools, a full workflow manager (think Taverna Workbench) and even the world’s only Windows-based implementation of the BWA aligner.  I’m very biased toward the command-line and scripting while at work so I’m not sure there’s a place for a GUI-based program in my workflow, but I can certainly see the utility for such software, especially for groups without a dedicated bioinformatician.  My interest is not particularly using UGENE as a general workbench, but as an enhanced genome browser.  I currently use IGV to view our bam files; users can access data locked away on our Linux servers from their Windows (or Mac, or Linux if we had any of those) desktops as I followed the instructions here (; very handy indeed.  Users are shielded from a crash-course in SSHing into a Linux server and the data is shielded from the users, who are only human (like me) and bound to accidentally delete or corrupt it eventually.  However, sometimes IGV feels both a little ugly and sometimes a bit bare-bones.  I guess a short attention-span and the urge for shiny novelty are common issues in the 21st century.  In the age of buttery-smooth apps running at 60fps on our smartphones, Java-swing desktop apps do look a bit ugly, but it would be foolish to judge software on “shinyness” over utility, particularly when IGV can be run natively on any platform with a Java Virtual Machine.

I was disappointed to discover that on my (fairly wimpy) Intel Core2 Duo workstation with 8GB of RAM UGENE took several hours to index a 14GB BAM file, when I can view the same file near-instantly using IGV and a standard bai index file.  Despite this there’s plenty of reasons to recommend UGENE; it’s not a great genome browser, but I feel it’s important to suffix that statement with the word “yet”.

Finally, there was some debate over why you should invest so much money in the enterprise-standard NVIDIA Tesla cards when you could feasibly use their much cheaper and consumer focussed (i.e. gaming) GeForce cards.  There are several good reasons.  While consumer cards are indeed cheaper, this is because they are produced in enormous quantities while attempting to squeeze the maximum profit out of them, meaning that they’re clocked at higher speeds and built from less reliable parts and may even have reduced feature/instruction sets.  Even playing your favourite game for 30+ hours a week with all the dials turned to maximum does not compare to the stress a Tesla card might face when consistently hammered with jobs 24/7/365 in a server.  It’s the same story with all computer hardware;  server hard drives are more expensive because they’re more robust and you are of course also paying for support which you just don’t receive or even want as a consumer.  It’s also worth noting that consumer cards have a different form factor and might not fit into a server in the first place or may block the airflow across the system in a catastrophic way.  You might build your own compute farm from consumer cards if you were, say, running some sort of Bitcoin mining operation at home, but you’ll really need Tesla cards if you’re doing science or engineering.  I suppose you could do some testing and local development using consumer cards, but when in full production you just need to spend the money.  You do want an accurate, repeatable answer, don’t you?

Thanks to the speakers for making me think, to NVIDIA for sponsoring the event and Igor Kozin from the ICR’s Scientific Computing Team for organising it.  Maybe Some rough results when I start toying with GPU-accelerated aligners will show up on this blog at some point?

GATK 2.X is here

If you’re in the business of variant discovery (and I am, amongst other things), you’ll know that things are changing and changing fast. A few years ago if you’d asked me what a SNV (Single Nucleotide Variant) was, I might have thought that you’d misspelled SNP (Single Nucleotide Polymorphism); in other words, I wouldn’t have known because the term wasn’t common parlance. Now the situation seems to have reversed, where I can confidently tell you that a SNV is any single nucleotide change in a sequence relative to the reference sequence (usually we’re talking GRCh37/hg19) and you’re probably referring to one of the six possible point mutations such as A>T. I say that only six are possible, because in double-stranded DNA, A>T and T>A are equivalent, depending on which strand you’re looking at. I prefer not to refer to single-base insertions or deletions (indels) as SNVs, but others do, so there’s still ambiguity even in the new nomenclature. Indels are another story for another day.

A former version of myself might have told you that SNPs were simple polymorphisms between individuals, but now I’m not so sure. Where does the boundary lie between what is a polymorphism distributed across the population and what is a variant, specific to an individual? The boundaries seem artificial and arbitrary. If we accept a SNP to be inherited and SNVs to be de novo events gained in an individual, do SNVs become SNPs in the next generation? Besides, if a mutation occurs de novo in multiple individuals, it is still present in the population, so is it a SNP or a mutant?

When you consider that dbSNP no longer contains just regular polymorphisms, but all kinds of variation, including indels and deleterious mutations it’s better, I say, to discard the term SNP altogether and start referring to them as SNVs or simply variants.

I was also fairly distrustful of SNP arrays owing to endless (and continuing) bad experiences with expression arrays, but the fact that DNA sequencing and SNP arrays tend to have very high concordance with one another has helped make me a believer, both in SNP arrays and next generation sequencing. If you’re doing sequencing and looking for such variants, you’re probably using the Broad Institute’s GATK (Genome Analysis ToolKit) to identify or “call” them.

The GATK (I say “gatt-kay” but others seem to prefer to spell out the letters) is a pretty nifty bit of kit. The Broad are well funded and very, very good at what they do. This is what you get when an interdisciplinary team of biologists, mathematicians and software engineers work together. GATK is a Java framework that does many of the things you want to do with your aligned BAM files and it’s designed for compute clusters, with all kinds of multicore trickery and scatter/gather (or map/reduce, if you prefer) goodness. It’s even got a lovely Scala wrapper called Queue to build and run entire analysis pipelines, though sadly it’s aimed primarily at Platform LSF whereas we use a MOAB/TORQUE job scheduler here. I’m sure we’ll get them playing nicely one day, but right now I’ve got a Perl script to run things for me. All in all, GATK is beautifully engineered and currently peerless, although I’m very much looking forward to seeing what the Sanger Centre release in the next 12-24 months.

For me, GATK’s primary use is SNV calling, which it’s very good at, if perhaps a little aggressive; nothing some post-calling filtering can’t fix, though. There have been regular incremental releases over the past year and then there were a few silent months, which made me suspicious. The Broad certainly weren’t going to abandon it, so I wasn’t terribly surprised to see a 2.0 release a few weeks ago. It’s been galloping forward and the version numbers are increasing rapidly.

Positively, it’s still openly accessible to all, but curiously now has a mixed open/closed source model. I suspect that this is monetize the program (to commercial entities, I’m not suggesting that the Broad would attempt to charge academic users), or more probably to keep unwanted third parties from monetizing it themselves by selling analysis; such are the risks of open-source software. I’m no open-source zealot and I have room in my life for both open and closed software; perhaps I’ll write about this another time.

My only issues so far are that when I receive a “[gsa-announce] GATK version X.Y-Z released” email from their mailing list, I can no longer do the following to rapidly update my installation:

git pull
ant clean

More correctly I can still do this, but what I pull down is the GATKlite, which is the open-source arm of GATK. At least I can still read the git comments to see what changes with each version. For the full version I’m forced to use a web interface to manually download binaries that I then push to my servers, which is inconvenient. Still, this may change in future, as everything is in flux, not least because the group appear to be in the grip of shifting everything from their old Wiki to their new Vanilla website. I’m a bit sorry to see their getsatisfaction website go, as I rather liked it and whenever I asked questions I was promptly answer. Pretty good support for free.

There doesn’t appear to be a published roadmap of features for the future, but right now I’m excited about the following two features (and you should be as well):

1.) BQSR v2. Improved (faster, more complex, more accurate, far better for indels) base quality score recalibration

2.) The HaplotypeCaller. This is the successor to the UnifiedGenotyper which I’ve been using with very satisfactory results to call SNVs. The major improvement is that it now includes de novo assembly to help call indels.  I’m talking real, de Bruijn graph-based assembly, just like Velvet. This means that we can hopefully say goodbye to seemingly endless lists of false-positives and hello to real indels!

It’ll be a little while before I shift completely to GATK2, but right now I’m evaluating it and comparing calls produced using the final 1.6 series of GATK and the future is looking very bright indeed. Superior tools lead to superior results.

Perl + shell = Sherl

Backticks. Technically speaking, what I’m advocating is probably bad form and frowned upon by real computer scientists, but I love system calls from within Perl scripts. Why would I write several lines of Perl to query what’s in a directory or write a temporary file when I could just use bash shell scripting commands wrapped in backticks?

## Use sed from within Perl to edit a file
`sed -i 's/something/$somethingelse/g' somefile.txt`;
## Read a list of chromosomes directly into an array
my @chromos = `cat /home/cwardell/b37/human_g1k_v37.fasta.fai | awk '{print \$1}' `;

If we consider the ends to be justified by the means (let ends = “getting as much work done in the most time-efficient way possible” and means = “a bit of hackery”), I see no real problem with it. This is a great deal of my code ends up containing a great deal of backticks. In case you’re not familiar with backticks, you can find them in the top-left of your keyboard, nestled between the Esc, Tab and 1 keys. Officially a backtick is termed the “grave accent” and is not to be confused with the apostrophe, which will do nothing for your code except throw errors, as your enclosed command will be interpreted as a string.

Bearing in mind that Perl is already an unforgivably unattractive language and you’re doing it no great aesthetic harm by throwing in some shell scripting, I therefore name this ugly mongrel mix of two ugly languages “Sherl” and I’ll keep using it because I sometimes value function over form, at least until I’m visualising data. I’m also trusting that your code is so comment-heavy that you’ll understand the what, how and why of your Sherl code.  You do obsessively comment your code, right?

What I Use

Things change quickly, particularly in bioinformatics, so I expect to write more than one of these posts in the future. As a snapshot of what I’m currently using in my day-to-day work:

PC: generic Dell desktop with Windows 7. Getting rid of Windows XP was a massive boost to my productivity
Primary server: Richter – a Dell PowerEdge R610 with 2x Intel Xeon X5650 processors and 48GB of RAM, running Ubuntu Linux 12.04
HPC: Dalek, Judoon and Ood. Respectively, two near identical SGI Altix 8400 clusters (32 nodes each, each identical to Richter with twice as much RAM) and an SGI Altix UV1000 with 1TB of shared RAM

I have to admit, I’m not short on compute, but space for data and analysis is much harder to come by.

Desktop –Microsoft Office for general computing.
Servers – In terms of real work, I do my best to seek out the best open-source solutions for my problems, or write my own solutions (see below). Pretty much all bioinformatics is comprised of “edge case” situations, where the edge in question is either very niche applications or problems that have never been solved before. The pieces of software that spring to mind in terms of sequencing would be the aligners I depend on (Stampy + BWA in hybrid mode), middleware like Picard and Samtools and the Broad Institute’s phenomenal Genome Analysis Tool Kit (GATK). More on this another day.

Writing code is essential to every bioinformatician, even if you’re just writing wrappers and glue to stick pre-existing programs and code together. I’m fickle, but currently keen on:

Perl – good for scripting and anything moderately complex, it’s a good workhorse, even if it’s unforgivably ugly. I’ll rather write in Python, but I’m not as strong in it and don’t yet have a compelling reason to switch.
Bash – we all love to shell script. The ease and elegance of piping the input and output of programs into one another on a Unix command line is phenomenally useful and powerful.
SQL – if you’ve got a lot of data, sooner or later you’ll need to whack it in a database and realistically, you have no other choice than a MySQL one. A friend of mine pointed out that a single well-written MySQL can be more satisfying than hundreds of lines of far more complex code. On the other hand, all the kludge that comes with MySQL is as unwelcome as it is unwieldy and ugly (particularly the Perl DBI. Yuck).
R – I’m pushing more and more of my work through R. I think this is because I love figures and working visually (i.e. interpreting data sets by creating plots and other images) and R has such powerful visualisation tools built right in (base graphics) or just a package download away (ggplot2). R is simultaneously beautifully simple and heinously broken, but you must love things because of their flaws, not in spite of them. The three primary pieces of advice I can offer new starters would be (in order of importance):

  1. Don’t give up. It’s hard, but nothing worth doing is easy
  2. Use the excellent RStudio IDE
  3. Read the R Inferno

If you’re not someone who used the 32 bit Windows version on a puny machine (no context-sensitive highlighting, no tab-completion, no easy list of variable types and dimensions) for some years and constantly ran into endless “cannot allocate vector of size…” error messages when you run out of memory, you probably can’t appreciate the joy of working with the 64 bit version on a server with dozens of GB of RAM.