Understanding BAM files; part 2

Now that you have the basics after part 1, let’s look at something more complicated; the second column of the main body of BAM files, the bitwise flag.

This isn’t simple.  Both the samtools manual and even the SAM specification ostensibly explain what the flag is, but do so in a really opaque way, a sort of “if you have to ask, you’ll never know” situation.  The bitwise flag looks like an integer, but it’s not.  You might see “99” or “147” and think it’s a score of some kind, but it’s actually a binary encoding of a hexadecimal value.  The data stored in this flag is crucial, including things like if the read is a duplicate or doesn’t pass quality controls.

The primary effect that this has is that you can kiss goodbye to doing anything complex  with BAM files using standard command-line text parsing tools.  If you want to really understand what is encoded in your BAM files, you’re going to have to get serious and become familiar with the samtools API, or some interface to it.  More on that in the next part.

If you want a specific flag explained, you can use this handy calculator.  But where are those values coming from?

With the SAM bitwise flag, there are 12 properties, which are TRUE or FALSE, 1 or 0.  This gives you a string of 12 digits, which can either be 1 or 0.  This 12-digit binary encoding of a hexadecimal number is converted to a decimal number, as it takes up less space.  Below is a summary; the first value is the binary hexadecimal, the second is the decimal and the text is what it represents:

  1. 000000000001 : 1 : read paired
  2. 000000000010 : 2 : read mapped in proper pair
  3. 000000000100 : 4 : read unmapped
  4. 000000001000 : 8 : mate unmapped
  5. 000000010000 : 16 : read reverse strand
  6. 000000100000 : 32 : mate reverse strand
  7. 000001000000 : 64 : first in pair
  8. 000010000000 : 128 : second in pair
  9. 000100000000 : 256 : not primary alignment
  10. 001000000000 : 512 : read fails platform/vendor quality checks
  11. 010000000000 : 1024 : read is PCR or optical duplicate
  12. 100000000000 : 2048 : supplementary alignment

To arrive at the final value, we just add the TRUE values together; you can see how easy this is when the binary hexadecimal values are stacked vertically.  Any FALSE values are ignored, as they are zero.

For example; a read that is paired (1), in a proper pair (2), with a mate on the reverse strand (32) and is the first in a pair (64) has a flag of 99 because:

1+2+32+64=99

The corresponding read in the pair would be a read that is paired (1), in a proper pair (2), is on the reverse strand (16) and is the second in a pair (128) has a flag of 147 because:

1+2+16+128=147

You should see a lot of pairs of reads with 99 and 147, and also 83 and 163, which is the same but with the reads in the opposite orientation.  I would advise against trying to parse SAM flags as decimal values and adopt a nice programmatic way of accessing them, which I’ll write about next time.

Understanding BAM files; part 1

SAM and BAM files are considerably more complex than they first look.  The SAM format specification hides the complexity well and it is easy to deceive yourself into doing the wrong thing.  That document is worth reading, but I will try to simplify it further here.  SAM files (Sequence Alignment Map) contain short DNA sequences aligned to a reference genome.  BAM files are simply the binary version, which means that they are compressed and about 1/3rd the size.  A SAM file is just a text file and can be read by anything; a BAM file can only be accessed by using particular software; e.g. samtools.

There are two parts to a SAM file; the header and the body.  The header contains information about the file, usually the following fields;

  1.  @HD line tells you the version number of SAM specification and whether or not the file is sorted.  Sorting by genomic coordinates is the most common type
  2. @SQ lines tell you which reference sequences have been used to align the sequence again.  There will be one line for every contig (e.g. a chromosome), which the sequence name and length (e.g. SN:1 LN:249250621)
  3. @RG line.  The Read Group line contains tags detailing the origins of the input data
  4. @PG lines tell you programs and commands used to create the file

SAM headers are not immutable, so don’t put too must trust in them.  You can see the header of a BAM file by using this samtools command:

samtools view -H bamfile.bam

The body of the file is where the aligned data is stored.  There is one row for every read.  Therefore, if you’re looking at paired end data (e.g. Illumina sequencing data) there will be two rows for every pair; one for the forward read and one for the reverse.

There are at least eleven columns in each row.  There may also be further tags appended at the end of these to denote extra data, like mapping scores, RG tags and so forth.

  1.  QNAME: the name of the query sequence e.g. a read name like HWI-ST886:247:D2G2MACXX:3:2102:4127:4699
  2. FLAG: a bitwise flag.  You’ll want to read Understanding BAM files; part 2 for an explanation
  3. RNAME: the name of the reference contig the sequence is aligned to e.g. 1 (chromosome 1)
  4. POS: the position on the reference contig that the alignment starts at e.g. 123456 (in this case, the 123,456th base of chromosome 1)
  5. MAPQ: the mapping quality of the read; maximum value is usually 60.  This number is Phred scaled, meaning that they’re logarithmically scaled.  For example:
    1. MAPQ of 10 means a 1/10 chance that the mapping is wrong
    2. MAPQ of 20 means a 1/100 chance that the mapping is wron
    3. MAPQ of 60 means a 1/1,000,000 chance that the mapping is wrong
  6. CIGAR: this string tells us how to match the query sequence to the reference sequence.  In the simplest case, say the read is 100 bases along and matches perfectly; CIGAR would be 100M.  However, you can encode mismatches, insertions and deletions and it quickly becomes very complicated
  7. RNEXT: the name of the reference contig that the other read in the pair aligns to.  If identical to this read, an equals sign (“=”) is used
  8. PNEXT: the position on the reference contig that the other read in the pair aligns to.  This depends on the size of the DNA fragments in the sequencing library and will usually be a few hundred base pairs away (see TLEN, below)
  9. TLEN: the length of the template fragment.  This is the distance from the leftmost base of the first read to the rightmost base of the second read of a pair.  This value is signed (+/-), so you can infer the orientation of the reads, too
  10. SEQ: the DNA sequence of the query sequence.  This will be identical to the sequence in the FASTQ file that was aligned to the reference genome
  11. QUAL: the corresponding base quality scores of the SEQ.  This will be identical to the scores in the original FASTQ file.  Note that it’s encoded as ASCII+33, so all the letters and punctuation symbols actually correspond to regular integers

Next time, we’ll look at the bitwise flag in more detail, as this is a major source of confusion.