Understanding BAM files; part 2

Now that you have the basics after part 1, let’s look at something more complicated; the second column of the main body of BAM files, the bitwise flag.

This isn’t simple.  Both the samtools manual and even the SAM specification ostensibly explain what the flag is, but do so in a really opaque way, a sort of “if you have to ask, you’ll never know” situation.  The bitwise flag looks like an integer, but it’s not.  You might see “99” or “147” and think it’s a score of some kind, but it’s actually a binary encoding of a hexadecimal value.  The data stored in this flag is crucial, including things like if the read is a duplicate or doesn’t pass quality controls.

The primary effect that this has is that you can kiss goodbye to doing anything complex  with BAM files using standard command-line text parsing tools.  If you want to really understand what is encoded in your BAM files, you’re going to have to get serious and become familiar with the samtools API, or some interface to it.  More on that in the next part.

If you want a specific flag explained, you can use this handy calculator.  But where are those values coming from?

With the SAM bitwise flag, there are 12 properties, which are TRUE or FALSE, 1 or 0.  This gives you a string of 12 digits, which can either be 1 or 0.  This 12-digit binary encoding of a hexadecimal number is converted to a decimal number, as it takes up less space.  Below is a summary; the first value is the binary hexadecimal, the second is the decimal and the text is what it represents:

  1. 000000000001 : 1 : read paired
  2. 000000000010 : 2 : read mapped in proper pair
  3. 000000000100 : 4 : read unmapped
  4. 000000001000 : 8 : mate unmapped
  5. 000000010000 : 16 : read reverse strand
  6. 000000100000 : 32 : mate reverse strand
  7. 000001000000 : 64 : first in pair
  8. 000010000000 : 128 : second in pair
  9. 000100000000 : 256 : not primary alignment
  10. 001000000000 : 512 : read fails platform/vendor quality checks
  11. 010000000000 : 1024 : read is PCR or optical duplicate
  12. 100000000000 : 2048 : supplementary alignment

To arrive at the final value, we just add the TRUE values together; you can see how easy this is when the binary hexadecimal values are stacked vertically.  Any FALSE values are ignored, as they are zero.

For example; a read that is paired (1), in a proper pair (2), with a mate on the reverse strand (32) and is the first in a pair (64) has a flag of 99 because:


The corresponding read in the pair would be a read that is paired (1), in a proper pair (2), is on the reverse strand (16) and is the second in a pair (128) has a flag of 147 because:


You should see a lot of pairs of reads with 99 and 147, and also 83 and 163, which is the same but with the reads in the opposite orientation.  I would advise against trying to parse SAM flags as decimal values and adopt a nice programmatic way of accessing them, which I’ll write about next time.

Understanding BAM files; part 1

SAM and BAM files are considerably more complex than they first look.  The SAM format specification hides the complexity well and it is easy to deceive yourself into doing the wrong thing.  That document is worth reading, but I will try to simplify it further here.  SAM files (Sequence Alignment Map) contain short DNA sequences aligned to a reference genome.  BAM files are simply the binary version, which means that they are compressed and about 1/3rd the size.  A SAM file is just a text file and can be read by anything; a BAM file can only be accessed by using particular software; e.g. samtools.

There are two parts to a SAM file; the header and the body.  The header contains information about the file, usually the following fields;

  1.  @HD line tells you the version number of SAM specification and whether or not the file is sorted.  Sorting by genomic coordinates is the most common type
  2. @SQ lines tell you which reference sequences have been used to align the sequence again.  There will be one line for every contig (e.g. a chromosome), which the sequence name and length (e.g. SN:1 LN:249250621)
  3. @RG line.  The Read Group line contains tags detailing the origins of the input data
  4. @PG lines tell you programs and commands used to create the file

SAM headers are not immutable, so don’t put too must trust in them.  You can see the header of a BAM file by using this samtools command:

samtools view -H bamfile.bam

The body of the file is where the aligned data is stored.  There is one row for every read.  Therefore, if you’re looking at paired end data (e.g. Illumina sequencing data) there will be two rows for every pair; one for the forward read and one for the reverse.

There are at least eleven columns in each row.  There may also be further tags appended at the end of these to denote extra data, like mapping scores, RG tags and so forth.

  1.  QNAME: the name of the query sequence e.g. a read name like HWI-ST886:247:D2G2MACXX:3:2102:4127:4699
  2. FLAG: a bitwise flag.  You’ll want to read Understanding BAM files; part 2 for an explanation
  3. RNAME: the name of the reference contig the sequence is aligned to e.g. 1 (chromosome 1)
  4. POS: the position on the reference contig that the alignment starts at e.g. 123456 (in this case, the 123,456th base of chromosome 1)
  5. MAPQ: the mapping quality of the read; maximum value is usually 60.  This number is Phred scaled, meaning that they’re logarithmically scaled.  For example:
    1. MAPQ of 10 means a 1/10 chance that the mapping is wrong
    2. MAPQ of 20 means a 1/100 chance that the mapping is wron
    3. MAPQ of 60 means a 1/1,000,000 chance that the mapping is wrong
  6. CIGAR: this string tells us how to match the query sequence to the reference sequence.  In the simplest case, say the read is 100 bases along and matches perfectly; CIGAR would be 100M.  However, you can encode mismatches, insertions and deletions and it quickly becomes very complicated
  7. RNEXT: the name of the reference contig that the other read in the pair aligns to.  If identical to this read, an equals sign (“=”) is used
  8. PNEXT: the position on the reference contig that the other read in the pair aligns to.  This depends on the size of the DNA fragments in the sequencing library and will usually be a few hundred base pairs away (see TLEN, below)
  9. TLEN: the length of the template fragment.  This is the distance from the leftmost base of the first read to the rightmost base of the second read of a pair.  This value is signed (+/-), so you can infer the orientation of the reads, too
  10. SEQ: the DNA sequence of the query sequence.  This will be identical to the sequence in the FASTQ file that was aligned to the reference genome
  11. QUAL: the corresponding base quality scores of the SEQ.  This will be identical to the scores in the original FASTQ file.  Note that it’s encoded as ASCII+33, so all the letters and punctuation symbols actually correspond to regular integers

Next time, we’ll look at the bitwise flag in more detail, as this is a major source of confusion.

MuTect – a pleasant surprise

I was sceptical (not a typo, I’m British).  Who wouldn’t be?  I’d invested literally hundreds of hours into my sequencing pipeline which depended almost entirely around GATK and even recently re-tooled to accommodate the sweeping changes brought by GATK 2.0 and it’s great, truly.  However….

I had been enviously eyeing up MuTect for quite some time, as it was always built from the ground-up to serve one of the purposes I am currently knee-deep in; finding mutations in tumour-normal pairs.  To be clear, I mean comparing sequencing data from a tumour samples with a normal sample from the same patient.  On the other hand, I gather GATK has its roots in GWAS studies and has always been more population-focussed.

The problem we’re all wrestling with currently – which, by it’s very nature, is never going away – is intraclonal heterogeneity, the idea that cancers are not monocultures of billions of copies of the same cell, but are highly heterogeneous mixes of potentially thousands (millions?) of individual clones, all of which are descended from (probably) one originator cell.  These clones compete with one another for resources in their niche and are effectively locked in an arms-race with one another, as well as with the host body.  Cancer patients are not facing a single entity, but a legion of cells that reinvent, mutate and transfigure themselves in the best Darwinian tradition.

As it stands, sequencing data isn’t a great read-out, as it’s produced from DNA pooled from thousands of tumour cells.  Even if you use a low-starting-amount protocol and had 50 ng of tumour DNA with no amplification steps, that’s still:

50 ng / 6 pg (DNA content of a single diploid cell) = >8000 cells

I guess if these cells are from a monocellular and therefore easier to purify population of cells (as is the case for myeloma – you can isolate tumour cells by selecting for CD138+ cells), that’s not terrible news; but what if your sample is a solid tumour?  Who knows what kind of structural heterogeneity you need to contend with?

Still, your sample is “bulk DNA” and is a sampling of potentially thousands of separate genomes compressed into a single readout.  I’m sure I’ll post more on the subject at a later date, but for now let’s say that it’s a definite issue.

In a homogeneous population of cells, SNVs (single nucleotide variants) will be present in all cells, but this will not be the case in a heterogeneous population of cancer cells.  MuTect’s real power lies is its ability to detect these subclonal mutations.  The detection is limited mostly by the depth of the sequencing, with quickly diminishing returns – supplementary table 1 in the MuTect paper does a good job of illustrating this, particularly if you turn on conditional formatting in Excel to turn the table of values into a heatmap.  Everybody loves conditional formatting!

MuTect’s output is not without issues, but I’ll discuss further filtering of the output next time.

Finally, it’s worth noting that YES, MuTect does output in VCF format… it’s just undocumented.  If you ask very nicely in their forums, the authors will help you.

It’s not me, it’s you – don’t use Perl for math(s)

After a good five years with Perl, it finally crossed the line and forced me to start moving over to Python.  This is a short cautionary tale with two messages:

  1. No language is suitable for every job
  2. Languages are only as good as their libraries/modules/packages/etc

I like Perl a great deal; it’s ideal for almost any task that can’t be achieved trivially on the command line with a few pipes and some awk.  Perl is fast and easy to write, and it’s perfect for text editing and tying other bits of code together.  You could even use it as a flexible wrapper for bash commands and scripts, as I have mentioned before.

However, recently Perl let me down and in a big way.  What I wanted to do was thousands of t-tests on thousands of arrays of numbers, reporting back some basic metrics like p-values and similar.  The hitch was that the input files were enormous text files (several GB).  My language options were either:

  1. R, with a simple and inbuilt t.test() function should be great, but the input size was a problem.  Reading the whole input file into memory was a viable approach as I work on systems with a lot of memory, but it’s not particularly fast or smart to do so.  I could also catch up on some R reading and find a way to do the calculations dynamically, but on the other hand…
  2. Perl has excellent IO capabilities and can produce results as the files are read in, making the code almost as fast as files could be read and written.  How very efficient – all I needed was to find a nice t-test module

I figured I could solve the problem faster using Perl, so I went fishing in CPAN (Comprehensive Perl Archive Network), the Perl module repository to pull out a suitable module.  The best match was very promising:


Written a decade ago (July 2003), surely the code was mature and this module had been used in anger again and again?  The script was quickly completed and set to work, only to fail mysteriously with this this cryptic error:

Can't take log of -0.0439586 at /usr/local/share/perl/5.14.2/Statistics/Distributions.pm line 325

After checking for obvious causes like stray zero values or NAs, some debugging revealed that Perl was failing because I was attempting a t-test on a pair of arrays of length 4710.  I wasn’t passing in any illegal values, it was failing purely because the input was too big.  There’s no way I’m going to accept that when a desktop PC running R can perform t-tests on arrays orders of magnitude bigger.

I decided to solve the problem in a memory-hungry fashion in R for the time being and to be fair to Perl, haven’t attempted to write the same code in Python, but as Python famously has both NumPy and SciPy which are written specifically to augment the inbuilt maths functions, I’m optimistic that it won’t fail me.

Shame on you, Perl.

Open Versus Closed Source: The Right Tool for the Right Job

TL;DR don’t use exclusively open or closed source software, you should use whatever is required for the job.

Windows 8 has just been released and by gum, I like it.  Now is probably a good time to briefly discuss the open versus closed source software debate.  I’m probably unusual in that my life is starkly divided in two when it comes to my interaction with software whether I’m at work or play.

At work, I often want and even need to see the source code.  What sort of input does the program want?  Why exactly is it throwing that error?  Maybe it needs to be compiled differently for whatever particular hardware or version of Linux I’m using.  This is because I often need a deep understanding of what code does, because I’m trying to solve particular problems in particular ways.  I use R a lot and one of many excellent features is that you can type a function name and the source for that function will appear.  It helps you develop and debug your own code much faster.

This means that an open-source environment is essential.  Tinkering with code (and indeed the entire Linux OS) is a very desirable feature.  I can barely imagine the horror of having nothing but pre-compiled closed-source software to work with; it would be an abomination.

However: when I am at home, I spend less time doing things which require that level of control.  If anything, it would be an enormous hindrance – I don’t want to be playing sysadmin on my own time.  I want the following things out of an OS when at home (an incomplete list in no particular order):

  1. Security – yes, Windows is secure.  It’s not 2003 any more
  2. Stability – see above.  I haven’t seen a BSOD for a long, long time
  3. Shininess – Windows 8 is pretty shiny.  Not literally, as “skeuomorphic design” is apparently passé
  4. Software – Windows has all the apps I want and there isn’t another platform that can compete on games
  5. Convenience – I don’t want to worry about hardware compatibility

That’s not to say that I don’t enjoy coding in my spare time.  All I really miss about my nice super-powerful hardware from work when I get home is the command line.  A nice GUI is excellent, but sometimes you want to do things that a GUI just can’t.  I used to feel that way, until I summed up the courage to learn Windows PowerShell, which is mostly Unix-like commands, anyway.  It’s not bad, if you’ll excuse the fact that you still have to horizontally resize the window manually (i.e. no click and drag) and the utterly terrifying error messages.  Or you could make your life really simple by installing MSYS which is most of the useful command-line Unix utilities ported to Windows.

Woah, there!

6 full lines? That’s pretty verbose, don’t you think?

I also use Windows at work (Windows 7) and find it to be the most efficient desktop environment for me, but each to their own.  I’m happy provided I can install an X-windows client (Xming) and a terminal (PuTTY).  Much of the rest of my working life is creating and editing MS Office documents, so why would I want a Linux desktop?  I want to juggle different and ever-changing distros and even window managers (Xfce?  Gnome 3?  KDE?  Unity?) like I want to juggle chainsaws or bake my own bread.  Not at all.

All I’m really saying is that even if you really passionately believe in open-source, there is nothing wrong with paying for closed-source software – it doesn’t make you a slave anymore than open-sourcing your code makes you a hippie.  You can have your cake and eat it.