Tuesday, May 24, 2016

How to align reads -- sparse index

So, apparently coming up with an aligner for reads -- long, short, or otherwise, is not all that difficult, especially when you have no errors. All you need to do is this:
  1. Sample kmers from your reference and store the locations of where these kmers occur ("anchors").
  2. While streaming your reads in, for every kmer in the read: if kmer is an anchor, add it to the list of anchors. For most reads, one anchor will be enough to correctly compute its location in the reference. For others, you may need to hit other anchors to disambiguate between multiple anchor hits.
  3. Profit.

I wrote some code that does just that: first, sample 20-mers at every 50-th position of human chr20. Then simulate the error-free reads by drawing N random numbers from [0, |C| - |R|+1] with |C| being the chromosome length and |R| being the read length. The most exciting things about aligning such reads is: for reads that hit multiple anchors, how can we efficiently find pair(s) of anchors that would be consistent with where anchors occur in the read? Some confusing situations may occur (green: reference, orange: anchors, blue: reads) that you'd want to resolve -- a quick solution to the rescue.



Now to some results:

Dataset% reads w/ a unique correct mapping% reads w/ at last one correct mapping
1Mln reads, 100bp89.1998.28
300K reads, 300bp91.7298.35
10K reads, 1000bp93.4298.29

Pretty good overall (UPD: bwa does superbly on this simple task by getting 100% of the alignments correctly for all three datasets). Now, more interesting points are: how do errors affect this mapping rate? My intuition tells me that it would be more of a problem for shorter reads (less chances to hit the anchor w/ error-free sequence) than for the long reads. Can you select anchors in a more reasonable way? Ideally, you would want less anchors to make such an index smaller, but at the same time you'd want to guarantee that a read of a given length hits at least one anchor. And what can you do for reads that hit no anchors at all (the anchor may be near)? I have some ideas, but will have to see if it works out and how that affects performance.
UPD2: this approach is very similar to how BLAST and BLAT get their initial seeds before performing full SW.