Legoplots in R (3D barplots in R)

I previously mentioned these, but I thought about it some more and went ahead and wrote an R implementation.  With very little work you could rejig this code to display all kinds of 3D bar plots; I’ve commented the code thoroughly so it should be easy to follow.  Immediately below is an example plot and below that is a gallery comparing an original plot by the Broad Institute with two versions produced by my code.  I wouldn’t want to slavishly copy, so I’ve added an optional transparency effect that is both useful and attractive.

My code with transparency
All 96 different mutation/context combinations, with added transparency

 

 

Arguments

  • z : a named vector of length 96 representing each of the 6 possible mutations in 16 different sequence contexts.  Note that the names are hard-coded and can be found in the code below.  Values represent the height of each column
  • alpha : transparency of column sides.  Top is always opaque.  Default is 1 (opaque)
  • scalexy : value to scale the area of all columns.  Small values lead to skinny columns, larger values lead to fatter columns.  Default is 10
  • scalez : value to scale the height of all columns; see notes below.  Default is 1
  • gap : a value determining the space between each column.  Default is 0.2

Things to note

  • It was essential to include the lit=FALSE argument when drawing surfaces, as the default lighting makes everything look very shiny and pretty ugly.  I also had to change the field of view (FOV) to 0, as the default value gives a slightly wall-eyed view.
  • The scalexy argument scales both the “fatness” of the columns and the gaps between them.
  • Scaling z using the scalez argument is useful if your data is very flat and you want to emphasize it.  It will have knock-on effects on the scale that appears at the side, so be careful.  If you’re going to transform your data, it’s probably best to do so before putting it into this function.
  • The alpha transparency I added to the column sides is very useful for viewing columns that are obstructed by other columns and has the added benefit of looking pretty cool.
  • I haven’t included the wireframe axis background, I leave that as an exercise for the reader, or maybe I’ll add it later.

Code is available below and example data is included if you check out my GitHub here.

## This script creates a "legoplot" similar to those produced by the Broad Institute
## The plot shows the relative abundance of each of the 6 possible mutations in the
## 16 sequence contexts

## Load packages
library(rgl)

#### START OF FUNCTIONS

## Functions modified from the "demo(hist3d)" examples in the rgl package:
# library(rgl)
# demo(hist3d)
## Note; would it have killed the original author to comment their code?

## Draws a single "column" or "stack".
## X and Y coordinates determine the area of the column
## The Z coordinate determines the height of the column
## We include "lit=FALSE" arguments to remove the nasty shiny surfaces caused by lighting
stackplot.3d<-function(x,y,z,alpha=1,topcol="#078E53",sidecol="#aaaaaa"){

## These lines allow the active rgl device to be updated with multiple changes
## This is necessary to draw the sides and ends of the column separately
save on.exit(par3d(save))

## Determine the coordinates of each surface of the column and its edges
x1=c(rep(c(x[1],x[2],x[2],x[1]),3),rep(x[1],4),rep(x[2],4))
z1=c(rep(0,4),rep(c(0,0,z,z),4))
y1=c(y[1],y[1],y[2],y[2],rep(y[1],4),rep(y[2],4),rep(c(y[1],y[2],y[2],y[1]),2))
x2=c(rep(c(x[1],x[1],x[2],x[2]),2),rep(c(x[1],x[2],rep(x[1],3),rep(x[2],3)),2))
z2=c(rep(c(0,z),4),rep(0,8),rep(z,8) )
y2=c(rep(y[1],4),rep(y[2],4),rep(c(rep(y[1],3),rep(y[2],3),y[1],y[2]),2) )

## These lines create the sides of the column and its coloured top surface
rgl.quads(x1,z1,y1,col=rep(sidecol,each=4),alpha=alpha,lit=FALSE)
rgl.quads(c(x[1],x[2],x[2],x[1]),rep(z,4),c(y[1],y[1],y[2],y[2]),
col=rep(topcol,each=4),alpha=1,lit=FALSE)
## This line adds black edges to the column
rgl.lines(x2,z2,y2,col="#000000",lit=FALSE)
}
# Example:
# stackplot.3d(c(0,1),c(0,1),3,alpha=0.6)

## Calls stackplot.3d repeatedly to create a barplot
## z is the heights of the columns and must be an appropriately named vector
context3d<-function(z,alpha=1,scalexy=10,scalez=1,gap=0.2){
## These lines allow the active rgl device to be updated with multiple changes
## This is necessary to add each column sequentially
save on.exit(par3d(save))

## Recreate Broad order
types=c("C.G.G.C","T.A.A.T","C.A.G.T","T.G.A.C","C.T.G.A","T.C.A.G")
contexts=c("TxT","CxT","AxT","GxT","TxC","CxC","AxC","GxC",
"TxA","CxA","AxA","GxA","TxG","CxG","AxG","GxG")
typeorder=c()
for(type in types){
typeorder=c(typeorder,paste(type,contexts,sep="_"))
}
z=z[typeorder]

## Reorder data into 6 regions
set1=c(1:4,17:20,5:8,21:24,9:12,25:28,13:16,29:32)
set2=set1+32
set3=set1+64
neworder=c(set1,set2,set3)

## Define dimensions of the plot
dimensions=c(12,8)

## Scale column area and the gap between columns
y=seq(1,dimensions[1])*scalexy
x=seq(1,dimensions[2])*scalexy
gap=gap*scalexy

## Scale z coordinate
z=z*scalez

## Set up colour palette
broadcolors=c("#805D3F","#72549A","#5EAFB2","#3F4F9D","#F2EC3C","#74B655")
colors=as.vector(sapply(broadcolors,rep,16))

## Plot each of the columns
for(i in 1:dimensions[1]){
for(j in 1:dimensions[2]){
it=(i-1)*dimensions[2]+j # Variable to work out which column to plot; counts from 1:96
stackplot.3d(c(gap+x[j],x[j]+scalexy),
c(-gap-y[i],-y[i]-scalexy),
z[neworder[it]],
alpha=alpha,
topcol=colors[neworder[it]],
sidecol=colors[neworder[it]])
}
}
## Set the viewpoint and add axes and labels
rgl.viewpoint(theta=50,phi=40,fov=0)
axes3d("y-+",labels=TRUE)
}
# Example:
# context3d(counts)

#### END OF FUNCTIONS

## Read in example data and cast to an appropriate vector
rawdata=read.table("snvspermegabase.txt",header=TRUE)
counts=as.numeric(rawdata)
names(counts)=colnames(rawdata)

## Example plots
context3d(counts)
context3d(counts,alpha=0.4)

## Save your images to files if you wish
rgl.snapshot(filename="example.png")

Stop the madness – no more pie charts

There has been a trend in the last few years to put interesting-looking but non-informative figures in papers; the pie chart is the worst recurrent offender.  I have no idea how they keep getting included, as they’re famously misleading and awful.  I want my work to look as much like the cockpit of a mecha or Iron Man’s HUD as possible, but I know that it’s not going to be as clear or concise as it should be.

ku-xlarge
So cool

A recent example is this otherwise very interesting and good paper:

Subclonal diversification of primary breast cancer revealed by multiregion sequencing by Yates et al, 2015

Figure 2 is confusing at best and reproduced below.  These pie-chart-like plots were made famous by Florence Nightingale 150 years ago and have been called “coxcomb plots”.  Wikipedia claims that that’s a mistake and they are really called “polar area diagrams” or “courbes circulaires”.

nm.3886-F2

I admit that they look pretty cool and are fine if you’re building some sort of hip infographic about consumer tastes or Facebook trends.  However, I think that they’re inappropriate and misleading because:

  1. They look unusual.  The reader expends a lot of energy working out what they’re looking at instead of processing the relationships in the data
  2. These data contain only one variable that changes, but the plots used can encode two variables (arc length and radius can vary, discussed below).  The plot is therefore unnecessarily complex
  3. In pie charts, the arc length is proportional to the value represented.  In these plots, the arc length is identical for each slice of pie… but you might not know this and you may infer that there is less of a certain segment.  This is appropriate for the type of data that Florence Nightingale was plotting (deaths in each month of the year), but not for arbitrary divisions
  4. In these plots, the radius of each segment (i.e. how far it extends from the centre) is informative.  You’re supposed to read off an almost-invisible grey scale of concentric rings, but it’s not easy.  Also, the visual effect is non-linear because the area of a circle is πr^2, which means that a small increase in radius has a disproportionately large gain
  5. It’s really hard to compare between plots; visual comparison is difficult and reading the scale to get at the numbers is even harder
  6. Multiple data types are represented in a single plot.  I’m not sure mixing somatic mutation variant allele frequencies and copy number log ratios is very effective

Some potential solutions

Criticizing without offering a solution isn’t very helpful, so I offer two solutions:

Use barplots or boxplots.  Yes, they’re boring, but they’re also familiar and easy to read.  I suppose you could even put a scale at either edge and mix data types without too much complaint (e.g. somatic VAFs and copy number logRs on the same plot).

A type of 3D barplot.  I generally think 3D is a bad thing when displayed in a 2D plane, particularly if it’s non interactive.  For example, a 3D shape on a 2D display is fine if you can rotate and zoom it freely.  “Lego plots” have been popularized by the Broad Institute in their sequencing papers, usually to show the sequence context of mutations (see below; taken from Exome and whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver events and mutational complexity by Dulak et al, 2013).  The irony of the pie charts isn’t lost on me.

ng_2591-F1

They’re relatively appealing and a relatively concise way of showing what would otherwise be a barplot of 96 values; it would get tricky if the data weren’t carefully arranged so as not to obscure anything, though.

If 3D barplots are now acceptable, why don’t we make a 3D barplot that encodes two variables?  The height would represent one variable and the depth another.

I’ve created a small data set and some R code to illustrate these points below:

Alternative plots

The data set represents 4 mutations (labeled A to D) in a set of 100 samples.  Each sample has a variant allele frequency (VAF) for each mutation, between 0 (not present) and 1 (every sequencing read contains it).

# A is frequent (80/100 samples) and usually subclonal (VAF ~0.1)
# B is infrequent (20/100 samples) and usually clonal (VAF ~0.8)
# C is frequent (90/100 samples) and usually clonal (VAF ~0.9)
# D is infrequent (15/100 samples) and usually subclonal (VAF ~0.15)

  • Coxcomb / polar area plot.  Arc length represents proportion of samples, radius (wedge length) represents median VAF.  Can be difficult to interpret.

coxcomb

  • Boxplot.  Good representation of the data, but hard to tell how many samples are represented.  For example, B is far less common than C but appears to be very similar.  Likewise for D and A.

boxplot

  • Barplot.  No representation of the spread of the data.  The proportion of affected samples is encoded using colour (darker mutations affect more samples).  Perhaps the colours could be applied to the boxplot for the best result?

barplot

  • 3D barplot.  It’s very basic (missing axes and labels), but shows the relationships between proportion and VAF more clearly than other plots.  I added some transparency so no column totally obscures any others.  It’s more convincing when you can rotate the object yourself (download the code and try it yourself), but I think even a static image is better than a Coxcomb/polar coordinate plot.

3dbarplot

Code is below and available on GitHub here.

## Load packages
library(rgl)
library(ggplot2)

#### START OF FUNCTIONS

## Functions modified from the "demo(hist3d)" examples in the rgl package:
# library(rgl)
# demo(hist3d)
## Would it have killed the author to comment their code?

## Draws a single "column" or "stack".
## X and Y coordinates determine the area of the stack
## The Z coordinate determines the height of the stack
stackplot.3d=function(x,y,z=1,alpha=1,topcol="#078E53",sidecol="#aaaaaa"){

## These lines allow the active rgl device to be updated with multiple changes
save=par3d(skipRedraw=TRUE)
on.exit(par3d(save))

## Determine the coordinates of each surface of the stack and its edges
x1=c(rep(c(x[1],x[2],x[2],x[1]),3),rep(x[1],4),rep(x[2],4))
z1=c(rep(0,4),rep(c(0,0,z,z),4))
y1=c(y[1],y[1],y[2],y[2],rep(y[1],4),rep(y[2],4),rep(c(y[1],y[2],y[2],y[1]),2))
x2=c(rep(c(x[1],x[1],x[2],x[2]),2),rep(c(x[1],x[2],rep(x[1],3),rep(x[2],3)),2))
z2=c(rep(c(0,z),4),rep(0,8),rep(z,8) )
y2=c(rep(y[1],4),rep(y[2],4),rep(c(rep(y[1],3),rep(y[2],3),y[1],y[2]),2) )

## These lines create the sides of the stack and its top surface
rgl.quads(x1,z1,y1,col=rep(sidecol,each=4),alpha=alpha)
rgl.quads(c(x[1],x[2],x[2],x[1]),rep(z,4),c(y[1],y[1],y[2],y[2]),
col=rep(topcol,each=4),alpha=1)
## This line adds black edges to the stack
rgl.lines(x2,z2,y2,col="#000000")
}
# Example:
#stackplot.3d(c(-0.5,0.5),c(4.5,5.5),3,alpha=0.6)

## Calls stackplot.3d repeatedly to create a barplot
# x is a constant distance along x axis
# y is the depth of column
# z is the height of column
barz3d=function(x,y,z,alpha=1,topcol="#078E53",sidecol="#aaaaaa",scaley=1,scalez=1){
## These lines allow the active rgl device to be updated with multiple changes
save=par3d(skipRedraw=TRUE)
on.exit(par3d(save))

## Plot each of the columns
n=length(x)
breaks.x = seq(0,n-1)
for(i in 1:n){
stackplot.3d(c(breaks.x[i],breaks.x[i]+1),c(0,-y[i])*scaley,z[i]*scalez,alpha=alpha,topcol=topcol)
}
## Set the viewpoint
rgl.viewpoint(theta=30,phi=25)
}
# Example
#barz3d(x=LETTERS[1:4],y=c(0.8,0.2,0.9,0.15),z=c(0.11,0.75,0.89,0.16),alpha=0.4,scaley=2,scalez=2)

#### END OF FUNCTIONS

## Example data:
# 4 mutations in 100 samples
# VAF range is from 0 to 1
# A is frequent and usually subclonal
# B is infrequent and usually clonal
# C is frequent and usually clonal
# D is infrequent and usually subclonal

Avaf=rnorm(80,0.1,0.05)
Bvaf=rnorm(20,0.8,0.1)
Cvaf=rnorm(90,0.9,0.05)
Dvaf=rnorm(15,0.15,0.05)

## Summarize data in new object
vafsum=data.frame(median=sapply(list(Avaf,Bvaf,Cvaf,Dvaf),median),
proportion=sapply(list(Avaf,Bvaf,Cvaf,Dvaf),function(x){length(x)/100}))
rownames(vafsum)=c(LETTERS[1:4])

## Code to produce coxcomb/polar coordinate plot adapted from:
## http://robinlovelace.net/r/2013/12/27/coxcomb-plots-spiecharts-R.html
## https://github.com/Robinlovelace/lilacPlot
pos = 0.5 * (cumsum(vafsum$proportion) + cumsum(c(0, vafsum$proportion[-length(vafsum$proportion)])))
p = ggplot(vafsum, aes(x=pos)) + geom_bar(aes(y=median), width=vafsum$proportion, color = "black", stat = "identity") + scale_x_continuous(labels = rownames(vafsum), breaks = pos) # Linear version is ok
p + coord_polar(theta = "x")
# (ignore warnings thrown)

## A traditional boxplot
boxplot(Avaf,Bvaf,Cvaf,Dvaf,names=LETTERS[1:4])

## A barplot where height represents median VAF and the color of the bar represents
## how many samples contain each mutation
barplot(vafsum$median,names=LETTERS[1:4],col=rgb(0.1,0.1,0.1,vafsum$proportion))

## Our new 3D barplot function
barz3d(x=LETTERS[1:4],y=vafsum$proportion,z=vafsum$median,alpha=0.4,scaley=2,scalez=2)
rgl.snapshot("3dbarplot.png", fmt = "png", top = TRUE )

R, GRanges objects and intersecting them

I discovered unpredicted behavior when intersecting GRanges objects; this is how to avoid the same problem I found.  Maybe my intuition wasn’t good, but this represents the a bad type of bug; a silent one.  If I had not spotted it, I would have lost several nonsynonymous KRAS mutations in my dataset.

If you want to represent a long list of genomic coordinates in R, you probably want to use a GRanges object.  It’s a very space and time efficient way of storing the data, even if it can be a little difficult to get accustomed to.  It’s ideal for representing bed files, aligned sequence data, SNV and indel coordinates, annotations and so forth.

In my case, I was intersecting two GRanges objects; I wanted to see which of my SNVs were within the boundaries of an exome capture kit.  In the example below, we use three exome capture regions (one from chromosome 1, exon 2 of KRAS and one from chromosome Y) and three consecutive SNVs i.e. SNVs that occur at immediately adjacent positions.  I could have just used exon 2 of KRAS, but I wanted to demonstrate the code can handle multiple locations on different chromosomes at once.

The SNVs could be in the same sample or different samples, it doesn’t matter; what matters is that their coordinates are contiguous.  The contiguous SNVs are collapsed into a single window (which is undesirable), then the regions are expanded back to single base-pair sites.  You of course can’t do this with indels of longer than 1bp.

We could have avoided this bug entirely if we’d worked in reverse and used setdiff() instead of intersect() to identify SNVs that don’t fall within the exome.

# Load packages
library(GenomicRanges)

## Create GRanges object of three exons
grexome=GRanges(seqnames=Rle(c(1,12,"Y")),
ranges=IRanges(c(13403,25398206,27010592),
end=c(13639,25398320,27010664)))

## Create GRanges object of three consecutive SNVs and
## two others on different chromosomes
grsnvs=GRanges(seqnames=Rle(c(1,rep(12,3),"Y")),
ranges=IRanges(c(13500,25398284,25398285,25398286,27010600),
end=c(13500,25398284,25398285,25398286,27010600)))

## Intersect the SNVs with the exome.
## You DO NOT get five hits back, but three.
## The 3 SNVs on chr12 are reported as a single, contiguous region
grhits=intersect(grsnvs,grexome)

## We expand the region back to the original, single-base sites
## and paste the appropriate chromosome to it, separated by a colon.
## Some duplication occurs, so we remove this using unique()
allpositions=as.data.frame(mapply(":",start(grhits),end(grhits)))
hits=as.vector(apply(allpositions,1,function(x){
paste0(as.vector(seqnames(grhits)),":",x)
}))
hits=sort(unique(hits))

## If you want a GRanges object again, just split the
## output back into its components and rebuild.
## Final line demonstrates that our results are identical to the input
chromosomes=sapply(strsplit(hits,":"),"[",1)
positions=as.numeric(sapply(strsplit(hits,":"),"[",2))
grsnvhits=GRanges(seqnames=Rle(chromosomes),ranges=IRanges(positions,end=positions))
identical(grsnvs,grsnvhits)

 

Analysing Dracula

I recently read Bram Stoker’s Dracula, mostly because I wanted to rewatch the 1990s movie of the same name and see how badly they’d mutilated the story.  [Update 26/04/2015: I rewatched the 1992 movie and while it’s the most faithful version to the book, it has some weird additions and is utterly appalling.  People have been mean about Keanu Reeves’ performance, but he’s no worse than anyone else and the direction is the real issue.  The Annie Lennox song at the end is nice, though].  The book is wholly composed of diary entries, letters and newspaper clippings.  This means that it’s effectively a self-contained social network, with every “sample” having a date, an author and an intended recipient.  It occurred to me that it might be interesting to look at these data and nobody else seems to have done so.  Fortunately, someone had tabulated much of the data, even if they hadn’t done anything further with it.  Therefore, thanks to Harold J Hotchkiss for typing it up over a decade ago.

Did it really happen?

Obviously not.  The year it takes place isn’t important, but it was published in 1897 and ends with a note dated “7 years later”, so let’s assume it takes place in 1890.  If you calculate the days of the week that events occur on throughout the book, you get the plot below.  I have no empirical evidence to back this up, but it looks too evenly distributed to me to be real.  I’ve looked at real time series before and you will generally see some sort of trend, be it for weekdays, weekends or a particularly day.  As it says in the book, “denn die Todten reiten Schnell”  (“for the dead travel fast”) and they apparently don’t take days off, either.

documents_by_day_of_the_week1

Authorship and structure of the book

~70% of the book is diary entries chiefly written by 3 authors.  John Seward (~40% of diary entries), Jonathan Harker (~31% of diary entries) and Wilhelmina “Mina” Murray-Harker (~24% of diary entries).  Mina gets a double-barreled surname because she marries Jonathan midway in the book and changes her surname, which was the style at the time.  I needed a single name to refer to both her ‘states’.  There is some truly incredible sexism in the book, it seems like gender was a big deal in the late 19th century.  You can divide documents by the type (e.g. letter, diary entry, etc) and also both who they are sent by and to.

documents_to_barplot documents_from_barplot

The above figures are global summaries of the entire book.  What is happening over the course of the book?  This is slightly complicated because the book isn’t in strict chronological order; chapters may feature letters and events that occurred previously.

document_type_by_chapter

The figure above gives a clear view of the structure of the book.  The plots below show who is writing what and to whom.  The first few chapters are Jonathan Harker’s diary of events in Castle Dracula, before a flurry of letters to introduce most of the rest of the characters.  Subsequently, the captain of the Demeter gives his account of bringing Dracula to Whitby (summary; not a lot of fun), and then Dr Seward gets on with the business of writing the majority of the book.  You can also clearly see how certain characters come and go; for example, Lucy Westenra’s death and Jonathan Harker’s heroic reappearance in the second half.

document_sender_by_chapter document_receiver_by_chapter

Barplots are great, but there other ways to represent the data and they can show structure that hard to grasp in any other way.

Timeline

dracula_sender_timeline

The above plot is a timeline of events in the book; it takes place in a mere 7 months.  Events in odd-numbered chapters are above the timeline and those in even-numbered chapters are below.  An arbitrary Y-axis separates events by their document type.  Events are represented by coloured squares corresponding to the sender of the document, with the colours matching those used in other plots here.

Network analysis

We’ll start with the most interesting figure, the network itself.  Primary characters are green hexagons and secondary characters are blue triangles.  Nodes are labeled with character initials and scaled relative to how many documents they sent or received.  Edges between nodes are also scaled relative to how many documents are sent.

dracula_network

A few things of note:

  • Nobody ever writes or receives any correspondence from Count Dracula
  • Jonathan Harker and Mina Murray-Harker never speak.  Letters are mentioned, but we never get to read them
  • It’s fairly interesting how many characters never speak to one another
  • Arthur Holmwood receives a lot of post, but doesn’t bother writing back.  I guess he’s too important being a Lord
  • Quincey exists mostly as a bizarre stereotype of Americana and he only communicates with Arthur Holmwood once.  He spends the entire book being very worthy, his actions announcing “I will die heroically” to the reader
  • The cluster of secondary characters is some removal companies talking to each other about Dracula’s boxes of earth, some newspapers reporting on events at Whitby and the “bloofer lady” (i.e. undead Lucy), and of course the unfortunate captain of the Demeter, scribbling in his log about the worst cruise ever

Below is a barplot of how many characters each character sends and receives documents from (this is a summary of the edges in the network above).  It includes self-edges, so people who write only to themselves still “receive” at a value of 1.  It further illustrates the fact that Arthur Holmwood is the most popular girl at school, whilst Jonathan Harker is a weird, angry loner who talks only to himself.  Van Helsing is nice enough to write to him… once.

characters_interacted_with_barplot

I have also produced a correlation plot, although the data plotted are not correlation values.  It shows the proportion of documents each author sent to each recipient.  The scale is misleading, as all values are non-negative (i.e. between 0 and 1).  I find it sad that Van Helsing likes to write letters and never writes in his own self-absorbed diary, but people don’t write back.  The Y-axis is sender and the X-axis is recipient.

sender_receiver_corrplot

Finally in this section, an alternative to a network is an arcplot.  These can look very nice indeed, but I don’t think they work particularly well for these data:

dracula_arcplot

Wordcloud

By all means go nuts and do some sentiment analysis or something more complicated, but I was happy just to create the obligatory word cloud.  They’re not really empirical, but they can be nice to look at, I guess.  The most common word is “said”, because the book is a “he-said-she-said” bonanza, like all social networks.

wordcloud

Final thoughts

It holds up quite well.  I was surprised to see how “modern” the text was; for example,

“To use an Americanism, he had ‘taken no chances'”

In other ways, it’s comically out of date.  It’s very much a story about a group of wealthy people and anyone else is given short shrift and comical accents.  Servants are also plentiful and the word “peasant” is used semi-frequently, usually to refer to people from central and eastern Europe:

Some of them were just like the peasants at home
or those I saw coming through France and Germany, with short jackets, and round hats, and home-made trousers; but others were very picturesque.  The women looked pretty, except when you got near them, but they were
very clumsy about the waist.

The sexism is so overt it’s quite funny, with men being strong and noble whilst women are mostly feeble and emotional victims:

“Ah, that wonderful Madam Mina!  She has man’s brain, a brain that a man should have were he much gifted, and a woman’s heart.”

It can just about be forgiven for its time, as Mina is ultimately instrumental in defeating Dracula and has a central role in events overall, even if she eventually becomes a bit of a damsel in distress.  On the other hand, if you wrote a vampire novel today, it would be unforgivable to have a female protagonist willingly submitting to all kinds of demeaning emotional and physical abuse… wouldn’t it?

Perhaps the strangest thing about the story is the way they combat and defeat vampires; on the one hand, the most potent weapons against them appear to be herbs and Christian mythology (particularly bits of magic wafer), all of which is treated with absolute po-faced seriousness.  On the other hand, while vampires seem to be averse to these things, but are not physically harmed by them.  They are susceptible to ultraviolence, though, as stabbing a vampire in the heart and sawing its head off is pretty effective. Maybe the power of prayer is overstated.  The next time you suspect that you’re in the presence of supernatural being intent on giving you immortality (although there are some downsides), try warding them off with a Kit Kat before you decide to brutally murder them, which will be frowned upon if they’re not actually one of the undead.

Files

You can access the code and files used on my GitHub repository, but the R code is copied below if you’re interested in a quick look at how the plots were made.  You can download the book itself for free from Amazon, and also a nice text version from Project Gutenberg.

#### Load packages ####
library(openxlsx) # for reading Excel files
library(RColorBrewer) # nice color palettes
library(ggplot2) # fancy-looking plots
library(corrplot) # correlation plot
library(network) # for network plot
library(sna) # for network plot
library(tm) # Text mining package
library(wordcloud) # word cloud plotting

#### Read in data and setup ####

## Read from Excel. Note that we have word counts for chapters,
## but never bother using them
chron = read.xlsx("dracula.xlsx",sheet="chronology", startRow=1,
 colNames=TRUE, skipEmptyRows=TRUE, rowNames=FALSE)
wc = read.xlsx("dracula.xlsx",sheet="wordcounts",
 startRow=1, colNames=TRUE, skipEmptyRows=TRUE, rowNames=FALSE)

## Reorder type column for aesthetic reasons and
## cast chapters as a factor
chron$Type=factor(chron$Type,levels=c("Diary","Letter",
"Telegram","Memo","Newspaper","Ship's log"))
chron$Chapter=factor(chron$Chapter)

## To reorder the characters, we manually define
## the order to ensure proper grouping
## A frequency-based approach looks more confused
primarycharacters=c("John Seward",
"Jonathan Harker",
"Wilhelmina Murray-Harker",
"Lucy Westenra",
"Abraham Van Helsing",
"Arthur Holmwood",
"Quincey P Morris")

secondarycharacters=c("Captain of the Demeter",
"The Dailygraph",
"The Westminster Gazette",
"The Pall Mall Gazette",
"Carter, Paterson & Co",
"Samuel F Billington & Son",
"Mitchell, Sons & Candy",
"Rufus Smith",
"Patrick Hennessey",
"Sister Agatha")
chron$From=factor(chron$From,levels=c(primarycharacters,
secondarycharacters))
chron$To=factor(chron$To,levels=c(primarycharacters,
secondarycharacters))

## Store the initials of each character for labelling plots
initials=c("JS","JH","WMH","LW","AVH","AH","QPM","CD",
"DG","WG","PMG","CPC","SBS","MSC","RS","PH","SA")

## Calculate days of the week - this cannot be done in
## Excel before the year 1900
chron$Date=as.Date(gsub(" ","-",chron$Date))
chron$Weekday=factor(weekdays(chron$Date),
levels=c("Monday","Tuesday","Wednesday",
"Thursday","Friday","Saturday","Sunday"))

## Define color palettes
pcols=brewer.pal(length(primarycharacters),"Set1")
scols=brewer.pal(length(secondarycharacters),"Set3")
acols=c(pcols,scols)

gothic=c("#6B090F","#360511","#080004","#230024","#3D1D54")

#### End of reading data and setup ###

#### Basic plots ####

## Which days documents were written on
ggplot(chron,aes(Weekday,fill=Type))+geom_bar()+
ggtitle("Documents by day of the week")

## Plot document types with who wrote them and who to:
ggplot(chron,aes(Type,fill=From))+geom_bar()+
scale_fill_manual(values=acols)+ggtitle("Documents From")
ggplot(chron,aes(Type,fill=To))+geom_bar()+
scale_fill_manual(values=acols)+ggtitle("Documents To")

## Who writes what and when, split by chapters
ggplot(chron,aes(Chapter,fill=Type))+geom_bar()+
ggtitle("Document type by chapter")
ggplot(chron,aes(Chapter,fill=From))+geom_bar()+
scale_fill_manual(values=acols)+ggtitle("Document sender by chapter")
ggplot(chron,aes(Chapter,fill=To))+geom_bar()+
scale_fill_manual(values=acols)+ggtitle("Document receiver by chapter")

## A heatmap of who sends what to who...
fromtotable=table(chron$From,chron$To)
fromtoproportions=fromtotable/rowSums(fromtotable)
corrplot(fromtoproportions,method="color",tl.col="black")

## corrplot() messes with the device setup, so reset it
dev.off()

#### End of basic plots ####

#### Timeline ####

## Crop off the last entry, as it's 7 years later
main=chron[1:197,]

## Define heights for points
height=rep(0,nrow(main))
height[main$Type=="Diary"]= 1
height[main$Type=="Ship's log"]=2
height[main$Type=="Memo"]= 3
height[main$Type=="Telegram"]=4
height[main$Type=="Newspaper"]= 5
height[main$Type=="Letter"]= 6

## Define vectors of colours for the plot
fromcols=as.character(main$From)
tocols=as.character(main$To)
for(i in 1:length(levels(chron$From))){
fromcols[fromcols==levels(chron$From)[i]]=acols[i]
tocols[tocols==levels(chron$From)[i]]=acols[i]
}

## Use chapter information to determine whether the
## point is above or below the line
## Odd numbers above, even below
main$Chapter=as.numeric(as.character(main$Chapter))
height[main$Chapter%%2==0]=height[main$Chapter%%2==0]*-1

## Perform plot
plot(x=main$Date,y=rep(1,nrow(main)),type="n",xaxt = "n",yaxt="n",
bty = "n",xlab = "Date", ylab=NA,ylim=c(-6,6),main="Dracula Timeline")
abline(h=-6:6,col="lightgrey")
points(main$Date,height,pch=15,col=fromcols)
segments(main$Date,rep(0,nrow(main)),main$Date,height,col=fromcols,lty=1)
#points(main$Date,height,pch=15,col=tocols) #uncomment for "to" colours
#segments(main$Date,rep(0,nrow(main)),main$Date,height,col=tocols)
u = par("usr")
arrows(u[1], 0, u[2], 0, xpd=TRUE,lwd=3)
xlabels=paste(c(month.abb[5:11]),"1890")
ylabels=c("Diary","Ship's log","Memo","Telegram","Newspaper","Letter")
axis(1, at=as.Date(c("1890-05-01","1890-06-01","1890-07-01",
"1890-08-01","1890-09-01","1890-10-01","1890-11-01")),labels=xlabels)
axis(2,at=c(-6:6),labels=c(rev(a),"",a),las=2,lty=0,cex.axis=0.7)

#### End of timeline ####

#### Network ####

## Create network of who communicates with who
comtable=table(chron$From,chron$To)
comtablebinary=comtable/comtable # divide by itself so all values are 1
comnetwork=network(comtablebinary,directed=TRUE,loops=TRUE)

## Number of people who write to each character
to=(colSums(comtablebinary,na.rm=TRUE))
## Number of people who each character writes to
from=(rowSums(comtablebinary,na.rm=TRUE))
sentreceived=cbind(from,to)

## Barplot of results
barplot(t(sentreceived),main="Characters interacted with",
ylab="Characters",beside=TRUE,names=initials,
las=3,col=c("#078E53","#111987"))
legend("topright",c("Sent","Received"),pch=rep(15,2),
col=c("#078E53","#111987"))

## Generate some scaled edge widths
edgewidths=as.vector(comtable)
edgewidths=edgewidths[edgeswidths!=0]
edgewidths=sapply(log2(edgewidths),max,1)

## Perform network plot
gplot(comnetwork,
interactive=TRUE, # Uncomment for interactive plotting
diag=TRUE, # self-loops
loop.steps=5, # angular self-loops
usecurve=TRUE, # curvy arrows
arrowhead.cex=0.5, # smaller arrowheads
label=initials, # initials not full names
label.pos=5, # labels inside shapes
label.cex=ifelse(levels(chron$From)%in%primarycharacters,1.1,0.75),
label.col=ifelse(levels(chron$From)%in%primarycharacters,"black","white"),
vertex.cex=(log2(table(c(as.character(chron$From),
as.character(chron$To)))[levels(chron$From)])+3)/2, # log2 scale
vertex.sides=ifelse(levels(chron$From)%in%primarycharacters,6,3),
vertex.rot=ifelse(levels(chron$From)%in%primarycharacters,0,30),
vertex.col=ifelse(levels(chron$From)%in%primarycharacters,
"#078E53","#111987"),
edge.lwd=edgewidths
)

#### End of network ####

#### Arc diagram ####

## IMPORTANT NOTE: loading this package will break the network-plotting
## code, which is why it's all commented out here:
# https://github.com/gastonstat/arcdiagram
# http://gastonsanchez.com/work/starwars/
# library(devtools) # for installing arcdiagram
# install_github('arcdiagram', username='gastonstat')
# library(arcdiagram) # for plotting arc diagram

## Generate a suitable object, order it for easy access and
## remove self-communication, as this can't be represented
## on an arc diagram
arcmat=as.matrix(chron[,c("From","To")])
arcmat=arcmat[!(arcmat[,"From"]==arcmat[,"To"]),]
arcmat=arcmat[order((arcmat[,2])),]
arcmat=arcmat[order((arcmat[,1])),]

## Define which arcs we will place above and which below
AVH2JS=c(1:6,14:16)
AVH2JH=c(7)
AVH2WMH=c(8:9,31:32)
WMH2SA=30
WMH2LW=c(21:23,33:36)

## Define a custom order of nodes
ordering=c("Abraham Van Helsing",
"John Seward",
"Jonathan Harker",
"Wilhelmina Murray-Harker",
"Arthur Holmwood",
"Quincey P Morris",
"Lucy Westenra",
"Carter, Paterson & Co",
"Samuel F Billington & Son",
"Mitchell, Sons & Candy",
"Sister Agatha",
"Rufus Smith",
"Patrick Hennessey")

## Perform the arc plot
## Col.nodes has a bug, so colours are defined manually
## Could be improved with weighted line arcs and/or nodes
arcplot(arcmat,above=c(AVH2JS,AVH2JH,AVH2WMH,WMH2SA,WMH2LW),
ordering=ordering,
pch.nodes=15,
col.nodes=c(rep("#078E53",6),rep("#111987",2),"#078E53",rep("#111987",4)),
cex.nodes=2,
cex.labels=0.5,
lwd.arcs=3,
col.arcs="#A71930")
legend("topleft",c("Primary character","Secondary character"),
pch=rep(15,2),col=c("#078E53","#111987"),box.lty=0)

#### End of arc diagram ####

#### Word cloud ####

## Plot a word cloud. It's not really quantitative, but it's pretty

## Create a corpus and do some basic cleaning up
book=read.delim("pg345.txt",header=FALSE,stringsAsFactors=FALSE)
corpus=Corpus(DataframeSource(book),
readerControl=list(language = "en"))
corpus=tm_map(corpus, content_transformer(removePunctuation))
corpus=tm_map(corpus, content_transformer(tolower))
corpus=tm_map(corpus, function(x) removeWords(x, stopwords("english")))

## Create the necessary objects for plotting
tdm = TermDocumentMatrix(corpus)
tdmatrix = as.matrix(tdm)
tdcounts = sort(rowSums(tdmatrix),decreasing=TRUE)
tddf = data.frame(word = names(tdcounts),freq=tdcounts)

## Perform plot
wordcloud(tddf$word,tddf$freq,min.freq=3,
random.order=FALSE, rot.per=0.15, colors=gothic,random.color=TRUE)

#### End of word cloud ####

 

Understanding BAM files; part 2

Now that you have the basics after part 1, let’s look at something more complicated; the second column of the main body of BAM files, the bitwise flag.

This isn’t simple.  Both the samtools manual and even the SAM specification ostensibly explain what the flag is, but do so in a really opaque way, a sort of “if you have to ask, you’ll never know” situation.  The bitwise flag looks like an integer, but it’s not.  You might see “99” or “147” and think it’s a score of some kind, but it’s actually a binary encoding of a hexadecimal value.  The data stored in this flag is crucial, including things like if the read is a duplicate or doesn’t pass quality controls.

The primary effect that this has is that you can kiss goodbye to doing anything complex  with BAM files using standard command-line text parsing tools.  If you want to really understand what is encoded in your BAM files, you’re going to have to get serious and become familiar with the samtools API, or some interface to it.  More on that in the next part.

If you want a specific flag explained, you can use this handy calculator.  But where are those values coming from?

With the SAM bitwise flag, there are 12 properties, which are TRUE or FALSE, 1 or 0.  This gives you a string of 12 digits, which can either be 1 or 0.  This 12-digit binary encoding of a hexadecimal number is converted to a decimal number, as it takes up less space.  Below is a summary; the first value is the binary hexadecimal, the second is the decimal and the text is what it represents:

  1. 000000000001 : 1 : read paired
  2. 000000000010 : 2 : read mapped in proper pair
  3. 000000000100 : 4 : read unmapped
  4. 000000001000 : 8 : mate unmapped
  5. 000000010000 : 16 : read reverse strand
  6. 000000100000 : 32 : mate reverse strand
  7. 000001000000 : 64 : first in pair
  8. 000010000000 : 128 : second in pair
  9. 000100000000 : 256 : not primary alignment
  10. 001000000000 : 512 : read fails platform/vendor quality checks
  11. 010000000000 : 1024 : read is PCR or optical duplicate
  12. 100000000000 : 2048 : supplementary alignment

To arrive at the final value, we just add the TRUE values together; you can see how easy this is when the binary hexadecimal values are stacked vertically.  Any FALSE values are ignored, as they are zero.

For example; a read that is paired (1), in a proper pair (2), with a mate on the reverse strand (32) and is the first in a pair (64) has a flag of 99 because:

1+2+32+64=99

The corresponding read in the pair would be a read that is paired (1), in a proper pair (2), is on the reverse strand (16) and is the second in a pair (128) has a flag of 147 because:

1+2+16+128=147

You should see a lot of pairs of reads with 99 and 147, and also 83 and 163, which is the same but with the reads in the opposite orientation.  I would advise against trying to parse SAM flags as decimal values and adopt a nice programmatic way of accessing them, which I’ll write about next time.

Understanding BAM files; part 1

SAM and BAM files are considerably more complex than they first look.  The SAM format specification hides the complexity well and it is easy to deceive yourself into doing the wrong thing.  That document is worth reading, but I will try to simplify it further here.  SAM files (Sequence Alignment Map) contain short DNA sequences aligned to a reference genome.  BAM files are simply the binary version, which means that they are compressed and about 1/3rd the size.  A SAM file is just a text file and can be read by anything; a BAM file can only be accessed by using particular software; e.g. samtools.

There are two parts to a SAM file; the header and the body.  The header contains information about the file, usually the following fields;

  1.  @HD line tells you the version number of SAM specification and whether or not the file is sorted.  Sorting by genomic coordinates is the most common type
  2. @SQ lines tell you which reference sequences have been used to align the sequence again.  There will be one line for every contig (e.g. a chromosome), which the sequence name and length (e.g. SN:1 LN:249250621)
  3. @RG line.  The Read Group line contains tags detailing the origins of the input data
  4. @PG lines tell you programs and commands used to create the file

SAM headers are not immutable, so don’t put too must trust in them.  You can see the header of a BAM file by using this samtools command:

samtools view -H bamfile.bam

The body of the file is where the aligned data is stored.  There is one row for every read.  Therefore, if you’re looking at paired end data (e.g. Illumina sequencing data) there will be two rows for every pair; one for the forward read and one for the reverse.

There are at least eleven columns in each row.  There may also be further tags appended at the end of these to denote extra data, like mapping scores, RG tags and so forth.

  1.  QNAME: the name of the query sequence e.g. a read name like HWI-ST886:247:D2G2MACXX:3:2102:4127:4699
  2. FLAG: a bitwise flag.  You’ll want to read Understanding BAM files; part 2 for an explanation
  3. RNAME: the name of the reference contig the sequence is aligned to e.g. 1 (chromosome 1)
  4. POS: the position on the reference contig that the alignment starts at e.g. 123456 (in this case, the 123,456th base of chromosome 1)
  5. MAPQ: the mapping quality of the read; maximum value is usually 60.  This number is Phred scaled, meaning that they’re logarithmically scaled.  For example:
    1. MAPQ of 10 means a 1/10 chance that the mapping is wrong
    2. MAPQ of 20 means a 1/100 chance that the mapping is wron
    3. MAPQ of 60 means a 1/1,000,000 chance that the mapping is wrong
  6. CIGAR: this string tells us how to match the query sequence to the reference sequence.  In the simplest case, say the read is 100 bases along and matches perfectly; CIGAR would be 100M.  However, you can encode mismatches, insertions and deletions and it quickly becomes very complicated
  7. RNEXT: the name of the reference contig that the other read in the pair aligns to.  If identical to this read, an equals sign (“=”) is used
  8. PNEXT: the position on the reference contig that the other read in the pair aligns to.  This depends on the size of the DNA fragments in the sequencing library and will usually be a few hundred base pairs away (see TLEN, below)
  9. TLEN: the length of the template fragment.  This is the distance from the leftmost base of the first read to the rightmost base of the second read of a pair.  This value is signed (+/-), so you can infer the orientation of the reads, too
  10. SEQ: the DNA sequence of the query sequence.  This will be identical to the sequence in the FASTQ file that was aligned to the reference genome
  11. QUAL: the corresponding base quality scores of the SEQ.  This will be identical to the scores in the original FASTQ file.  Note that it’s encoded as ASCII+33, so all the letters and punctuation symbols actually correspond to regular integers

Next time, we’ll look at the bitwise flag in more detail, as this is a major source of confusion.

MuTect – a pleasant surprise

I was sceptical (not a typo, I’m British).  Who wouldn’t be?  I’d invested literally hundreds of hours into my sequencing pipeline which depended almost entirely around GATK and even recently re-tooled to accommodate the sweeping changes brought by GATK 2.0 and it’s great, truly.  However….

I had been enviously eyeing up MuTect for quite some time, as it was always built from the ground-up to serve one of the purposes I am currently knee-deep in; finding mutations in tumour-normal pairs.  To be clear, I mean comparing sequencing data from a tumour samples with a normal sample from the same patient.  On the other hand, I gather GATK has its roots in GWAS studies and has always been more population-focussed.

The problem we’re all wrestling with currently – which, by it’s very nature, is never going away – is intraclonal heterogeneity, the idea that cancers are not monocultures of billions of copies of the same cell, but are highly heterogeneous mixes of potentially thousands (millions?) of individual clones, all of which are descended from (probably) one originator cell.  These clones compete with one another for resources in their niche and are effectively locked in an arms-race with one another, as well as with the host body.  Cancer patients are not facing a single entity, but a legion of cells that reinvent, mutate and transfigure themselves in the best Darwinian tradition.

As it stands, sequencing data isn’t a great read-out, as it’s produced from DNA pooled from thousands of tumour cells.  Even if you use a low-starting-amount protocol and had 50 ng of tumour DNA with no amplification steps, that’s still:

50 ng / 6 pg (DNA content of a single diploid cell) = >8000 cells

I guess if these cells are from a monocellular and therefore easier to purify population of cells (as is the case for myeloma – you can isolate tumour cells by selecting for CD138+ cells), that’s not terrible news; but what if your sample is a solid tumour?  Who knows what kind of structural heterogeneity you need to contend with?

Still, your sample is “bulk DNA” and is a sampling of potentially thousands of separate genomes compressed into a single readout.  I’m sure I’ll post more on the subject at a later date, but for now let’s say that it’s a definite issue.

In a homogeneous population of cells, SNVs (single nucleotide variants) will be present in all cells, but this will not be the case in a heterogeneous population of cancer cells.  MuTect’s real power lies is its ability to detect these subclonal mutations.  The detection is limited mostly by the depth of the sequencing, with quickly diminishing returns – supplementary table 1 in the MuTect paper does a good job of illustrating this, particularly if you turn on conditional formatting in Excel to turn the table of values into a heatmap.  Everybody loves conditional formatting!

MuTect’s output is not without issues, but I’ll discuss further filtering of the output next time.

Finally, it’s worth noting that YES, MuTect does output in VCF format… it’s just undocumented.  If you ask very nicely in their forums, the authors will help you.

It’s not me, it’s you – don’t use Perl for math(s)

After a good five years with Perl, it finally crossed the line and forced me to start moving over to Python.  This is a short cautionary tale with two messages:

  1. No language is suitable for every job
  2. Languages are only as good as their libraries/modules/packages/etc

I like Perl a great deal; it’s ideal for almost any task that can’t be achieved trivially on the command line with a few pipes and some awk.  Perl is fast and easy to write, and it’s perfect for text editing and tying other bits of code together.  You could even use it as a flexible wrapper for bash commands and scripts, as I have mentioned before.

However, recently Perl let me down and in a big way.  What I wanted to do was thousands of t-tests on thousands of arrays of numbers, reporting back some basic metrics like p-values and similar.  The hitch was that the input files were enormous text files (several GB).  My language options were either:

  1. R, with a simple and inbuilt t.test() function should be great, but the input size was a problem.  Reading the whole input file into memory was a viable approach as I work on systems with a lot of memory, but it’s not particularly fast or smart to do so.  I could also catch up on some R reading and find a way to do the calculations dynamically, but on the other hand…
  2. Perl has excellent IO capabilities and can produce results as the files are read in, making the code almost as fast as files could be read and written.  How very efficient – all I needed was to find a nice t-test module

I figured I could solve the problem faster using Perl, so I went fishing in CPAN (Comprehensive Perl Archive Network), the Perl module repository to pull out a suitable module.  The best match was very promising:

Statistics::TTest

Written a decade ago (July 2003), surely the code was mature and this module had been used in anger again and again?  The script was quickly completed and set to work, only to fail mysteriously with this this cryptic error:

Can't take log of -0.0439586 at /usr/local/share/perl/5.14.2/Statistics/Distributions.pm line 325

After checking for obvious causes like stray zero values or NAs, some debugging revealed that Perl was failing because I was attempting a t-test on a pair of arrays of length 4710.  I wasn’t passing in any illegal values, it was failing purely because the input was too big.  There’s no way I’m going to accept that when a desktop PC running R can perform t-tests on arrays orders of magnitude bigger.

I decided to solve the problem in a memory-hungry fashion in R for the time being and to be fair to Perl, haven’t attempted to write the same code in Python, but as Python famously has both NumPy and SciPy which are written specifically to augment the inbuilt maths functions, I’m optimistic that it won’t fail me.

Shame on you, Perl.

Open Versus Closed Source: The Right Tool for the Right Job

TL;DR don’t use exclusively open or closed source software, you should use whatever is required for the job.

Windows 8 has just been released and by gum, I like it.  Now is probably a good time to briefly discuss the open versus closed source software debate.  I’m probably unusual in that my life is starkly divided in two when it comes to my interaction with software whether I’m at work or play.

At work, I often want and even need to see the source code.  What sort of input does the program want?  Why exactly is it throwing that error?  Maybe it needs to be compiled differently for whatever particular hardware or version of Linux I’m using.  This is because I often need a deep understanding of what code does, because I’m trying to solve particular problems in particular ways.  I use R a lot and one of many excellent features is that you can type a function name and the source for that function will appear.  It helps you develop and debug your own code much faster.

This means that an open-source environment is essential.  Tinkering with code (and indeed the entire Linux OS) is a very desirable feature.  I can barely imagine the horror of having nothing but pre-compiled closed-source software to work with; it would be an abomination.

However: when I am at home, I spend less time doing things which require that level of control.  If anything, it would be an enormous hindrance – I don’t want to be playing sysadmin on my own time.  I want the following things out of an OS when at home (an incomplete list in no particular order):

  1. Security – yes, Windows is secure.  It’s not 2003 any more
  2. Stability – see above.  I haven’t seen a BSOD for a long, long time
  3. Shininess – Windows 8 is pretty shiny.  Not literally, as “skeuomorphic design” is apparently passé
  4. Software – Windows has all the apps I want and there isn’t another platform that can compete on games
  5. Convenience – I don’t want to worry about hardware compatibility

That’s not to say that I don’t enjoy coding in my spare time.  All I really miss about my nice super-powerful hardware from work when I get home is the command line.  A nice GUI is excellent, but sometimes you want to do things that a GUI just can’t.  I used to feel that way, until I summed up the courage to learn Windows PowerShell, which is mostly Unix-like commands, anyway.  It’s not bad, if you’ll excuse the fact that you still have to horizontally resize the window manually (i.e. no click and drag) and the utterly terrifying error messages.  Or you could make your life really simple by installing MSYS which is most of the useful command-line Unix utilities ported to Windows.

Woah, there!
6 full lines? That’s pretty verbose, don’t you think?

I also use Windows at work (Windows 7) and find it to be the most efficient desktop environment for me, but each to their own.  I’m happy provided I can install an X-windows client (Xming) and a terminal (PuTTY).  Much of the rest of my working life is creating and editing MS Office documents, so why would I want a Linux desktop?  I want to juggle different and ever-changing distros and even window managers (Xfce?  Gnome 3?  KDE?  Unity?) like I want to juggle chainsaws or bake my own bread.  Not at all.

All I’m really saying is that even if you really passionately believe in open-source, there is nothing wrong with paying for closed-source software – it doesn’t make you a slave anymore than open-sourcing your code makes you a hippie.  You can have your cake and eat it.

GPUs and Bioinformatics

The ICR (note fabulous new branding) organised a half-day  seminar on “GPU Acceleration in Bioinformatics”.  NVIDIA were kind enough to sponsor the event, which meant flying speakers in from as far as Russia and Hong Kong.  Thanks, NVIDIA!

All computers contain at least one processor, the CPU (central processing unit), but many also contain a specialised secondary processor which is geared toward doing the calculations necessary to produce graphics quickly, the GPU (graphics processing unit).  Although these may be embedded on the motherboard of your PC (or even on the same die as the CPU in some cases) for serious amounts of graphical power you’ll need a separate device, a so-called graphics card.

The first question to address is why would you want to use a GPU at all?  In desktop scenarios, this is easy because many users have a GPU in their machine doing almost nothing most of the time.  What if you could make use of all this power?  You’ll call it “hardware acceleration”.  Microsoft recently posted a very informative article outlining how they’ve baked this kind of GPU utilisation right into Windows 8 to speed up everything from the obvious (3D game performance) to the seemingly mundane (text rendering).  Even so, the the majority of the power of a GPU is still largely unused in a desktop machine when not gaming.

However, the situation in computational biology is not so simple, as most of my compute is performed on Linux clusters, nowhere near a traditional desktop computer, with no GUI and therefore no apparent need for a GPU.  At first glance, you might suspect that bioinformaticians just wanted an excuse to get their funders to furnish them with all-singing-all-dancing gaming rigs, but this isn’t the case at all.  GPUs are exceptionally good at performing huge numbers of similar operations in parallel and it just so happens that many of the problems we have in biology (a great example being sequence alignment to a reference genome) are exactly these sort of embarrassingly parallelisable problems.

Having established the potential need for GPUs, you need to choose a language; CUDA (NVIDIA specific) or openCL (an open standard).  I’m not a computer-scientist, but I am a gamer and I’ve noted that NVIDIA cards tend to be better (for my purposes) than ATi cards, though I’ve only switched to NVIDIA in the last few years.  This was partly because I noticed that non-gaming applications I use  such as Handbrake (a video transcoder) would be much faster using a CUDA-enabled card.  You might count that as anecdotal evidence that CUDA is more widely supported and used when even consumers have heard of it.  OpenCL offers hardware-agnostic portability and academics will often vouch for open solutions but as I have a preference for NVIDIA cards, certain apps I like run in CUDA already and anything written in OpenCL would run on an NVIDIA card but not vice-versa… the case for NVIDIA seems clear.

The talks themselves were very good, with highlights including:

SOAP3-dp.  A new version of the SOAP aligner has just been released.  It looks quite exciting as it now runs on GPUs, running “easy” alignments via two-way Burrows-Wheeler transforms (i.e. it works like BWA) on the GPU and leaving harder reads to the CPU , so much like running Stampy in hybrid mode with BWA, which has been my preferred approach.  I guess I need to run some comparisons.  On the other hand, I guess we need to get our GPU server up and running, first.  Soon.

Coincidentally, the very next day I received an automated update from Gerton Lunter that a new version of Stampy had been released.  It’s different in two main ways; firstly, it’s now multithreaded so should be substantially faster.  Secondly, it no longer runs concurrently with BWA; rather, you align first using BWA and then allow Stampy to realign any poorly aligned reads.  You could run Stampy on any bam file, so maybe we might end up using a SOAP3-dp/Stampy hybrid.  Who can say?  I recently had a disagreement with someone who was using Novoalign, which I gather is much like running Stampy/BWA, but costs actual money.  Proprietary hardware and even operating systems?  Fine.  Proprietary aligners?  I think not, at least not without some serious empirical evidence.

Unipro UGENE.  All the way from Novosibirsk in Russia came Yuriy Vaskin to show off the UGENE software.  I’d used it briefly before and have been generally impressed by the sheer number of features that they’ve packed into it – everything from (multiple) multiple alignment tools, a full workflow manager (think Taverna Workbench) and even the world’s only Windows-based implementation of the BWA aligner.  I’m very biased toward the command-line and scripting while at work so I’m not sure there’s a place for a GUI-based program in my workflow, but I can certainly see the utility for such software, especially for groups without a dedicated bioinformatician.  My interest is not particularly using UGENE as a general workbench, but as an enhanced genome browser.  I currently use IGV to view our bam files; users can access data locked away on our Linux servers from their Windows (or Mac, or Linux if we had any of those) desktops as I followed the instructions here (http://www.broadinstitute.org/igv/DataServer); very handy indeed.  Users are shielded from a crash-course in SSHing into a Linux server and the data is shielded from the users, who are only human (like me) and bound to accidentally delete or corrupt it eventually.  However, sometimes IGV feels both a little ugly and sometimes a bit bare-bones.  I guess a short attention-span and the urge for shiny novelty are common issues in the 21st century.  In the age of buttery-smooth apps running at 60fps on our smartphones, Java-swing desktop apps do look a bit ugly, but it would be foolish to judge software on “shinyness” over utility, particularly when IGV can be run natively on any platform with a Java Virtual Machine.

I was disappointed to discover that on my (fairly wimpy) Intel Core2 Duo workstation with 8GB of RAM UGENE took several hours to index a 14GB BAM file, when I can view the same file near-instantly using IGV and a standard bai index file.  Despite this there’s plenty of reasons to recommend UGENE; it’s not a great genome browser, but I feel it’s important to suffix that statement with the word “yet”.

Finally, there was some debate over why you should invest so much money in the enterprise-standard NVIDIA Tesla cards when you could feasibly use their much cheaper and consumer focussed (i.e. gaming) GeForce cards.  There are several good reasons.  While consumer cards are indeed cheaper, this is because they are produced in enormous quantities while attempting to squeeze the maximum profit out of them, meaning that they’re clocked at higher speeds and built from less reliable parts and may even have reduced feature/instruction sets.  Even playing your favourite game for 30+ hours a week with all the dials turned to maximum does not compare to the stress a Tesla card might face when consistently hammered with jobs 24/7/365 in a server.  It’s the same story with all computer hardware;  server hard drives are more expensive because they’re more robust and you are of course also paying for support which you just don’t receive or even want as a consumer.  It’s also worth noting that consumer cards have a different form factor and might not fit into a server in the first place or may block the airflow across the system in a catastrophic way.  You might build your own compute farm from consumer cards if you were, say, running some sort of Bitcoin mining operation at home, but you’ll really need Tesla cards if you’re doing science or engineering.  I suppose you could do some testing and local development using consumer cards, but when in full production you just need to spend the money.  You do want an accurate, repeatable answer, don’t you?

Thanks to the speakers for making me think, to NVIDIA for sponsoring the event and Igor Kozin from the ICR’s Scientific Computing Team for organising it.  Maybe Some rough results when I start toying with GPU-accelerated aligners will show up on this blog at some point?