SAM and BAM files are considerably more complex than they first look. The SAM format specification hides the complexity well and it is easy to deceive yourself into doing the wrong thing. That document is worth reading, but I will try to simplify it further here. SAM files (Sequence Alignment Map) contain short DNA sequences aligned to a reference genome. BAM files are simply the binary version, which means that they are compressed and about 1/3rd the size. A SAM file is just a text file and can be read by anything; a BAM file can only be accessed by using particular software; e.g. samtools.
There are two parts to a SAM file; the header and the body. The header contains information about the file, usually the following fields;
- @HD line tells you the version number of SAM specification and whether or not the file is sorted. Sorting by genomic coordinates is the most common type
- @SQ lines tell you which reference sequences have been used to align the sequence again. There will be one line for every contig (e.g. a chromosome), which the sequence name and length (e.g. SN:1 LN:249250621)
- @RG line. The Read Group line contains tags detailing the origins of the input data
- @PG lines tell you programs and commands used to create the file
SAM headers are not immutable, so don’t put too must trust in them. You can see the header of a BAM file by using this samtools command:
samtools view -H bamfile.bam
The body of the file is where the aligned data is stored. There is one row for every read. Therefore, if you’re looking at paired end data (e.g. Illumina sequencing data) there will be two rows for every pair; one for the forward read and one for the reverse.
There are at least eleven columns in each row. There may also be further tags appended at the end of these to denote extra data, like mapping scores, RG tags and so forth.
- QNAME: the name of the query sequence e.g. a read name like HWI-ST886:247:D2G2MACXX:3:2102:4127:4699
- FLAG: a bitwise flag. You’ll want to read Understanding BAM files; part 2 for an explanation
- RNAME: the name of the reference contig the sequence is aligned to e.g. 1 (chromosome 1)
- POS: the position on the reference contig that the alignment starts at e.g. 123456 (in this case, the 123,456th base of chromosome 1)
- MAPQ: the mapping quality of the read; maximum value is usually 60. This number is Phred scaled, meaning that they’re logarithmically scaled. For example:
- MAPQ of 10 means a 1/10 chance that the mapping is wrong
- MAPQ of 20 means a 1/100 chance that the mapping is wron
- MAPQ of 60 means a 1/1,000,000 chance that the mapping is wrong
- CIGAR: this string tells us how to match the query sequence to the reference sequence. In the simplest case, say the read is 100 bases along and matches perfectly; CIGAR would be 100M. However, you can encode mismatches, insertions and deletions and it quickly becomes very complicated
- RNEXT: the name of the reference contig that the other read in the pair aligns to. If identical to this read, an equals sign (“=”) is used
- PNEXT: the position on the reference contig that the other read in the pair aligns to. This depends on the size of the DNA fragments in the sequencing library and will usually be a few hundred base pairs away (see TLEN, below)
- TLEN: the length of the template fragment. This is the distance from the leftmost base of the first read to the rightmost base of the second read of a pair. This value is signed (+/-), so you can infer the orientation of the reads, too
- SEQ: the DNA sequence of the query sequence. This will be identical to the sequence in the FASTQ file that was aligned to the reference genome
- QUAL: the corresponding base quality scores of the SEQ. This will be identical to the scores in the original FASTQ file. Note that it’s encoded as ASCII+33, so all the letters and punctuation symbols actually correspond to regular integers
Next time, we’ll look at the bitwise flag in more detail, as this is a major source of confusion.