Now that you have the basics after part 1, let’s look at something more complicated; the second column of the main body of BAM files, the bitwise flag.
This isn’t simple. Both the samtools manual and even the SAM specification ostensibly explain what the flag is, but do so in a really opaque way, a sort of “if you have to ask, you’ll never know” situation. The bitwise flag looks like an integer, but it’s not. You might see “99” or “147” and think it’s a score of some kind, but it’s actually a binary encoding of a hexadecimal value. The data stored in this flag is crucial, including things like if the read is a duplicate or doesn’t pass quality controls.
The primary effect that this has is that you can kiss goodbye to doing anything complex with BAM files using standard command-line text parsing tools. If you want to really understand what is encoded in your BAM files, you’re going to have to get serious and become familiar with the samtools API, or some interface to it. More on that in the next part.
If you want a specific flag explained, you can use this handy calculator. But where are those values coming from?
With the SAM bitwise flag, there are 12 properties, which are TRUE or FALSE, 1 or 0. This gives you a string of 12 digits, which can either be 1 or 0. This 12-digit binary encoding of a hexadecimal number is converted to a decimal number, as it takes up less space. Below is a summary; the first value is the binary hexadecimal, the second is the decimal and the text is what it represents:
- 000000000001 : 1 : read paired
- 000000000010 : 2 : read mapped in proper pair
- 000000000100 : 4 : read unmapped
- 000000001000 : 8 : mate unmapped
- 000000010000 : 16 : read reverse strand
- 000000100000 : 32 : mate reverse strand
- 000001000000 : 64 : first in pair
- 000010000000 : 128 : second in pair
- 000100000000 : 256 : not primary alignment
- 001000000000 : 512 : read fails platform/vendor quality checks
- 010000000000 : 1024 : read is PCR or optical duplicate
- 100000000000 : 2048 : supplementary alignment
To arrive at the final value, we just add the TRUE values together; you can see how easy this is when the binary hexadecimal values are stacked vertically. Any FALSE values are ignored, as they are zero.
For example; a read that is paired (1), in a proper pair (2), with a mate on the reverse strand (32) and is the first in a pair (64) has a flag of 99 because:
The corresponding read in the pair would be a read that is paired (1), in a proper pair (2), is on the reverse strand (16) and is the second in a pair (128) has a flag of 147 because:
You should see a lot of pairs of reads with 99 and 147, and also 83 and 163, which is the same but with the reads in the opposite orientation. I would advise against trying to parse SAM flags as decimal values and adopt a nice programmatic way of accessing them, which I’ll write about next time.