MuTect – a pleasant surprise

I was sceptical (not a typo, I’m British).  Who wouldn’t be?  I’d invested literally hundreds of hours into my sequencing pipeline which depended almost entirely around GATK and even recently re-tooled to accommodate the sweeping changes brought by GATK 2.0 and it’s great, truly.  However….

I had been enviously eyeing up MuTect for quite some time, as it was always built from the ground-up to serve one of the purposes I am currently knee-deep in; finding mutations in tumour-normal pairs.  To be clear, I mean comparing sequencing data from a tumour samples with a normal sample from the same patient.  On the other hand, I gather GATK has its roots in GWAS studies and has always been more population-focussed.

The problem we’re all wrestling with currently – which, by it’s very nature, is never going away – is intraclonal heterogeneity, the idea that cancers are not monocultures of billions of copies of the same cell, but are highly heterogeneous mixes of potentially thousands (millions?) of individual clones, all of which are descended from (probably) one originator cell.  These clones compete with one another for resources in their niche and are effectively locked in an arms-race with one another, as well as with the host body.  Cancer patients are not facing a single entity, but a legion of cells that reinvent, mutate and transfigure themselves in the best Darwinian tradition.

As it stands, sequencing data isn’t a great read-out, as it’s produced from DNA pooled from thousands of tumour cells.  Even if you use a low-starting-amount protocol and had 50 ng of tumour DNA with no amplification steps, that’s still:

50 ng / 6 pg (DNA content of a single diploid cell) = >8000 cells

I guess if these cells are from a monocellular and therefore easier to purify population of cells (as is the case for myeloma – you can isolate tumour cells by selecting for CD138+ cells), that’s not terrible news; but what if your sample is a solid tumour?  Who knows what kind of structural heterogeneity you need to contend with?

Still, your sample is “bulk DNA” and is a sampling of potentially thousands of separate genomes compressed into a single readout.  I’m sure I’ll post more on the subject at a later date, but for now let’s say that it’s a definite issue.

In a homogeneous population of cells, SNVs (single nucleotide variants) will be present in all cells, but this will not be the case in a heterogeneous population of cancer cells.  MuTect’s real power lies is its ability to detect these subclonal mutations.  The detection is limited mostly by the depth of the sequencing, with quickly diminishing returns – supplementary table 1 in the MuTect paper does a good job of illustrating this, particularly if you turn on conditional formatting in Excel to turn the table of values into a heatmap.  Everybody loves conditional formatting!

MuTect’s output is not without issues, but I’ll discuss further filtering of the output next time.

Finally, it’s worth noting that YES, MuTect does output in VCF format… it’s just undocumented.  If you ask very nicely in their forums, the authors will help you.