GPUs and Bioinformatics

The ICR (note fabulous new branding) organised a half-day  seminar on “GPU Acceleration in Bioinformatics”.  NVIDIA were kind enough to sponsor the event, which meant flying speakers in from as far as Russia and Hong Kong.  Thanks, NVIDIA!

All computers contain at least one processor, the CPU (central processing unit), but many also contain a specialised secondary processor which is geared toward doing the calculations necessary to produce graphics quickly, the GPU (graphics processing unit).  Although these may be embedded on the motherboard of your PC (or even on the same die as the CPU in some cases) for serious amounts of graphical power you’ll need a separate device, a so-called graphics card.

The first question to address is why would you want to use a GPU at all?  In desktop scenarios, this is easy because many users have a GPU in their machine doing almost nothing most of the time.  What if you could make use of all this power?  You’ll call it “hardware acceleration”.  Microsoft recently posted a very informative article outlining how they’ve baked this kind of GPU utilisation right into Windows 8 to speed up everything from the obvious (3D game performance) to the seemingly mundane (text rendering).  Even so, the the majority of the power of a GPU is still largely unused in a desktop machine when not gaming.

However, the situation in computational biology is not so simple, as most of my compute is performed on Linux clusters, nowhere near a traditional desktop computer, with no GUI and therefore no apparent need for a GPU.  At first glance, you might suspect that bioinformaticians just wanted an excuse to get their funders to furnish them with all-singing-all-dancing gaming rigs, but this isn’t the case at all.  GPUs are exceptionally good at performing huge numbers of similar operations in parallel and it just so happens that many of the problems we have in biology (a great example being sequence alignment to a reference genome) are exactly these sort of embarrassingly parallelisable problems.

Having established the potential need for GPUs, you need to choose a language; CUDA (NVIDIA specific) or openCL (an open standard).  I’m not a computer-scientist, but I am a gamer and I’ve noted that NVIDIA cards tend to be better (for my purposes) than ATi cards, though I’ve only switched to NVIDIA in the last few years.  This was partly because I noticed that non-gaming applications I use  such as Handbrake (a video transcoder) would be much faster using a CUDA-enabled card.  You might count that as anecdotal evidence that CUDA is more widely supported and used when even consumers have heard of it.  OpenCL offers hardware-agnostic portability and academics will often vouch for open solutions but as I have a preference for NVIDIA cards, certain apps I like run in CUDA already and anything written in OpenCL would run on an NVIDIA card but not vice-versa… the case for NVIDIA seems clear.

The talks themselves were very good, with highlights including:

SOAP3-dp.  A new version of the SOAP aligner has just been released.  It looks quite exciting as it now runs on GPUs, running “easy” alignments via two-way Burrows-Wheeler transforms (i.e. it works like BWA) on the GPU and leaving harder reads to the CPU , so much like running Stampy in hybrid mode with BWA, which has been my preferred approach.  I guess I need to run some comparisons.  On the other hand, I guess we need to get our GPU server up and running, first.  Soon.

Coincidentally, the very next day I received an automated update from Gerton Lunter that a new version of Stampy had been released.  It’s different in two main ways; firstly, it’s now multithreaded so should be substantially faster.  Secondly, it no longer runs concurrently with BWA; rather, you align first using BWA and then allow Stampy to realign any poorly aligned reads.  You could run Stampy on any bam file, so maybe we might end up using a SOAP3-dp/Stampy hybrid.  Who can say?  I recently had a disagreement with someone who was using Novoalign, which I gather is much like running Stampy/BWA, but costs actual money.  Proprietary hardware and even operating systems?  Fine.  Proprietary aligners?  I think not, at least not without some serious empirical evidence.

Unipro UGENE.  All the way from Novosibirsk in Russia came Yuriy Vaskin to show off the UGENE software.  I’d used it briefly before and have been generally impressed by the sheer number of features that they’ve packed into it – everything from (multiple) multiple alignment tools, a full workflow manager (think Taverna Workbench) and even the world’s only Windows-based implementation of the BWA aligner.  I’m very biased toward the command-line and scripting while at work so I’m not sure there’s a place for a GUI-based program in my workflow, but I can certainly see the utility for such software, especially for groups without a dedicated bioinformatician.  My interest is not particularly using UGENE as a general workbench, but as an enhanced genome browser.  I currently use IGV to view our bam files; users can access data locked away on our Linux servers from their Windows (or Mac, or Linux if we had any of those) desktops as I followed the instructions here (http://www.broadinstitute.org/igv/DataServer); very handy indeed.  Users are shielded from a crash-course in SSHing into a Linux server and the data is shielded from the users, who are only human (like me) and bound to accidentally delete or corrupt it eventually.  However, sometimes IGV feels both a little ugly and sometimes a bit bare-bones.  I guess a short attention-span and the urge for shiny novelty are common issues in the 21st century.  In the age of buttery-smooth apps running at 60fps on our smartphones, Java-swing desktop apps do look a bit ugly, but it would be foolish to judge software on “shinyness” over utility, particularly when IGV can be run natively on any platform with a Java Virtual Machine.

I was disappointed to discover that on my (fairly wimpy) Intel Core2 Duo workstation with 8GB of RAM UGENE took several hours to index a 14GB BAM file, when I can view the same file near-instantly using IGV and a standard bai index file.  Despite this there’s plenty of reasons to recommend UGENE; it’s not a great genome browser, but I feel it’s important to suffix that statement with the word “yet”.

Finally, there was some debate over why you should invest so much money in the enterprise-standard NVIDIA Tesla cards when you could feasibly use their much cheaper and consumer focussed (i.e. gaming) GeForce cards.  There are several good reasons.  While consumer cards are indeed cheaper, this is because they are produced in enormous quantities while attempting to squeeze the maximum profit out of them, meaning that they’re clocked at higher speeds and built from less reliable parts and may even have reduced feature/instruction sets.  Even playing your favourite game for 30+ hours a week with all the dials turned to maximum does not compare to the stress a Tesla card might face when consistently hammered with jobs 24/7/365 in a server.  It’s the same story with all computer hardware;  server hard drives are more expensive because they’re more robust and you are of course also paying for support which you just don’t receive or even want as a consumer.  It’s also worth noting that consumer cards have a different form factor and might not fit into a server in the first place or may block the airflow across the system in a catastrophic way.  You might build your own compute farm from consumer cards if you were, say, running some sort of Bitcoin mining operation at home, but you’ll really need Tesla cards if you’re doing science or engineering.  I suppose you could do some testing and local development using consumer cards, but when in full production you just need to spend the money.  You do want an accurate, repeatable answer, don’t you?

Thanks to the speakers for making me think, to NVIDIA for sponsoring the event and Igor Kozin from the ICR’s Scientific Computing Team for organising it.  Maybe Some rough results when I start toying with GPU-accelerated aligners will show up on this blog at some point?

What I Use

Things change quickly, particularly in bioinformatics, so I expect to write more than one of these posts in the future. As a snapshot of what I’m currently using in my day-to-day work:

Hardware:
PC: generic Dell desktop with Windows 7. Getting rid of Windows XP was a massive boost to my productivity
Primary server: Richter – a Dell PowerEdge R610 with 2x Intel Xeon X5650 processors and 48GB of RAM, running Ubuntu Linux 12.04
HPC: Dalek, Judoon and Ood. Respectively, two near identical SGI Altix 8400 clusters (32 nodes each, each identical to Richter with twice as much RAM) and an SGI Altix UV1000 with 1TB of shared RAM

I have to admit, I’m not short on compute, but space for data and analysis is much harder to come by.

Software:
Desktop –Microsoft Office for general computing.
Servers – In terms of real work, I do my best to seek out the best open-source solutions for my problems, or write my own solutions (see below). Pretty much all bioinformatics is comprised of “edge case” situations, where the edge in question is either very niche applications or problems that have never been solved before. The pieces of software that spring to mind in terms of sequencing would be the aligners I depend on (Stampy + BWA in hybrid mode), middleware like Picard and Samtools and the Broad Institute’s phenomenal Genome Analysis Tool Kit (GATK). More on this another day.

Code:
Writing code is essential to every bioinformatician, even if you’re just writing wrappers and glue to stick pre-existing programs and code together. I’m fickle, but currently keen on:

Perl – good for scripting and anything moderately complex, it’s a good workhorse, even if it’s unforgivably ugly. I’ll rather write in Python, but I’m not as strong in it and don’t yet have a compelling reason to switch.
Bash – we all love to shell script. The ease and elegance of piping the input and output of programs into one another on a Unix command line is phenomenally useful and powerful.
SQL – if you’ve got a lot of data, sooner or later you’ll need to whack it in a database and realistically, you have no other choice than a MySQL one. A friend of mine pointed out that a single well-written MySQL can be more satisfying than hundreds of lines of far more complex code. On the other hand, all the kludge that comes with MySQL is as unwelcome as it is unwieldy and ugly (particularly the Perl DBI. Yuck).
R – I’m pushing more and more of my work through R. I think this is because I love figures and working visually (i.e. interpreting data sets by creating plots and other images) and R has such powerful visualisation tools built right in (base graphics) or just a package download away (ggplot2). R is simultaneously beautifully simple and heinously broken, but you must love things because of their flaws, not in spite of them. The three primary pieces of advice I can offer new starters would be (in order of importance):

  1. Don’t give up. It’s hard, but nothing worth doing is easy
  2. Use the excellent RStudio IDE
  3. Read the R Inferno

If you’re not someone who used the 32 bit Windows version on a puny machine (no context-sensitive highlighting, no tab-completion, no easy list of variable types and dimensions) for some years and constantly ran into endless “cannot allocate vector of size…” error messages when you run out of memory, you probably can’t appreciate the joy of working with the 64 bit version on a server with dozens of GB of RAM.