Friday, February 12, 2010

The mathematics of color space sequencing

‘Next-generation’ sequencing technologies are dramatically transforming medical fields and life sciences. ABI SOLiD machines are very popular, because they can sequence an order of magnitude more nucleotides than most other competing machines (454, Solexa, etc.) for the same cost. However, bioinformaticians analyzing SOLiD data have to confront the puzzling world of color space. Hopefully, the following discussion will reduce some of the confusions. We will only tackle the mathematics behind color space sequencing here, and leave the chemistry of SOLiD sequencing for another day.

Let us say that we are interested in sequencing a DNA segment. A typical sequencing machine will sequentially identify every nucleotide as A,C,G,T and report a long stream of ATGGTGGTTGTTACTGCGCGTGGGAACCCCCTG… etc. Instead, the SOLiD machine reports transitions between neighboring nucleotide pairs. When one looks at the pairs of neighboring nucleotides, the number of reportable combinations increase from 4 (A, C, G, T) to 16 (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT). To simplify reporting, the SOLiD machines elegantly reduces the possible combinations from 16 to 4 based on the following table:



We will soon elaborate on what kind of though went behind choosing the numbers in the above table. Also we will show that this reduction of complexity comes at a cost. First let us explain how the color space works.

How to convert sequences to color space?

Let us work a specific example (ATGGTGGTTGTTA).

More here...