Tuesday, May 24, 2016

How to align reads -- sparse index

So, apparently coming up with an aligner for reads -- long, short, or otherwise, is not all that difficult, especially when you have no errors. All you need to do is this:
  1. Sample kmers from your reference and store the locations of where these kmers occur ("anchors").
  2. While streaming your reads in, for every kmer in the read: if kmer is an anchor, add it to the list of anchors. For most reads, one anchor will be enough to correctly compute its location in the reference. For others, you may need to hit other anchors to disambiguate between multiple anchor hits.
  3. Profit.

I wrote some code that does just that: first, sample 20-mers at every 50-th position of human chr20. Then simulate the error-free reads by drawing N random numbers from [0, |C| - |R|+1] with |C| being the chromosome length and |R| being the read length. The most exciting things about aligning such reads is: for reads that hit multiple anchors, how can we efficiently find pair(s) of anchors that would be consistent with where anchors occur in the read? Some confusing situations may occur (green: reference, orange: anchors, blue: reads) that you'd want to resolve -- a quick solution to the rescue.



Now to some results:

Dataset% reads w/ a unique correct mapping% reads w/ at last one correct mapping
1Mln reads, 100bp89.1998.28
300K reads, 300bp91.7298.35
10K reads, 1000bp93.4298.29

Pretty good overall (UPD: bwa does superbly on this simple task by getting 100% of the alignments correctly for all three datasets). Now, more interesting points are: how do errors affect this mapping rate? My intuition tells me that it would be more of a problem for shorter reads (less chances to hit the anchor w/ error-free sequence) than for the long reads. Can you select anchors in a more reasonable way? Ideally, you would want less anchors to make such an index smaller, but at the same time you'd want to guarantee that a read of a given length hits at least one anchor. And what can you do for reads that hit no anchors at all (the anchor may be near)? I have some ideas, but will have to see if it works out and how that affects performance.
UPD2: this approach is very similar to how BLAST and BLAT get their initial seeds before performing full SW.

Tuesday, December 08, 2015

Aligning long reads

With long reads becoming more popular and accessible, major read aligners responded by providing mapping capabilities that can work with long, high-error reads. STAR's Alex Dobin has put out a tool STARlong that can map PacBio's Circular Consensus Seqeuencong (CCS) reads. You would need to install it alongside your regular STAR binary (instructions here). Once you have that, generate an index for you genome and tune some parameters to get the alignments:
# assuming STARlong is installed
# generate an index for your genome:
...
# align reads
STARlong --runThreadN 20 --runMode alignReads --seedPerReadNmax 10000 --genomeDir  --readFilesIn 

BWA's Heng Li has also provided a version of bwa mapper that can align Oxford Nanopore and PacBio's long subreads and shorter CCS reads. Here is how he suggests running the alignments:
bwa index ref.fa
bwa mem -x pacbio ref.fa pacbio.fq > aln.sam
bwa mem -x ont2d ref.fa ont-2D.fq > aln.sam

Further discussions on aligning long Nanopore and PacBio reads:
Biostars ] [ choosing STAR parameters for long reads ] [ STAR parameters to use with IsoSeq reads ]

Saturday, September 05, 2015

Those pesky author lists

Having recently had to correct a lot of author lists to BibTeX-compliant list format, I wrote a little Javascript to help me do the task:



This would convert the above author names to be "Hurt, Jessica A. and Thibodeau, Stacey A. and Hirsh, Andrew S. ". I found this especially useful for those Nature papers with 10+ authors :)

Tuesday, October 21, 2014

The colors of my life

Inspired by vibrant colors of Indian outfits, I have created a new stylesheet for my MacOS terminal:


Juuust enough contrast for me, colors are not at all discordant when taken all together and definitely brighten up my day, esp in contrast to the sulky fall weather. 

La-la-la-loves it!

MacOS stylesheet file (exported from the Terminal app)

Thursday, September 11, 2014

Science communication

While there is a lot of training in how to communicate ideas clearly to other scientists, the need and specific aspects of speaking to a more general public are rarely addressed in PhD programs. It may seem that if you can give a clear and concise talk on your latest findings at a conference, then you should easily be able to convey the same information to an audience of people without all the specialized training you have invested in. When this approach fails, it is easy to blame the audience: "If only they understood science better!", or claim that the knowledge gap is too wide and the audience should just trust you, the scientist, because you have the training and experience. This is not an example of successful communication, but it can be made into one. John Radzilowicz, a speaker at CMU's Public Communication for Researchers (PCR) seminar outlines 10 important points for successful communication to non-scientific audience:

  1. Know your audience (also: rule #1 from English101). Assess what your audience might already know and move past it -- people will enjoy learning new things rather than covering the same high school material all over again. Are there any big debates in the area you are covering? Be prepared to offer some comment on them if asked, however, it is fine to admit that this is out of your area of expertise.
  2. Understand the goals of your communication. What are you trying to achieve by giving this talk: inspire and wow people? dispel some myths? gather support for formal education? make knowledge more accessible (e.g. explain genetic testing)? Giving your goals, try to identify one or two take home messages and make sure to repeat these several times throughout your presentation.
  3. Report accurate data, but try to make it accessible. An example is: how many Sun masses should the star be to become a supernova? The precise answer may be 8.2 - 15.6 Sun masses. But simply saying that "If a star is around 10 Sun masses, it may become a supernova" will make it easier for your audience to remember the number -- and it is still within the exact range! Try to explain things in the simplest way, act as a "science translator".
  4. Be fun and engaging. This could be hard to some of us who live in their underground labs and see no sunlight, but simply smiling as the audience pours into the room and asking them "How are you doing?" before launching into the talk increases retention of information drastically! A dialog is the best thing that can happen.
  5. Remember that your audience is not a blank slate. They come from a variety of backgrounds, belief systems, and have varying levels of knowledge. Ask them what they know -- and try to incorporate that into your talk. Survey their beliefs -- and be respectful of them, even if you do not agree with some of it. This also helps to keep your audience engaged.
  6. Establish trust. Explain how we know what we know. It helps to note which phenomena we can explain and which ones are still a mystery. Do not shy away from facts that were once accepted truth and were later found to be false -- this is the self-correcting power of science that makes it trustworthy. Share the sense of wonder and excitement with your audience -- it could be inspirational.
  7. Be ready to give more than just facts. You might be prepared to talk about your favorite topic, but you may find yourself explaining the scientific process, how experiments are set up and theories tested -- be ready to fill such gaps if the situation demands it. You might need to cover some of the parts multiple times to get the point across or go into more details -- it is part of science exploration.
  8. Acknowledge the humanity of science. Science is done be humans and humans make mistakes. There is plagiarism and fraud, yet science has an amazing self-correcting quality. It may take years, but the truth comes out (how about them arsenic life forms). Science can be used for good or for evil -- by people, but that does not make the knowledge better or worse.
  9. Leave them wanting more -- you will have to leave out some of the details to make the talk more accessible, but mention that "the whole picture is more complex". For the curious, this will be enough to pique their interest.
  10. "Walk the walk". If you want to be better at science communication, you have to start learning more about it: watch documentaries by/about such public science communicators as Carl Sagan, Richard Feynman, and Neil deGrasse Tyson. Learn from the best!
And now I will practice these in conversations with my mom :)





Thursday, June 12, 2014

Plurals in English

If mouse becomes mice, then many a house become hice.
And if a singular for mice is a mouse, then singular for lice is a louse!
And if a singular for mice is a mouse, then singular for many grains of rice is a rouse!

If you have a single ox, then many of them are oxen.
So, if you have a single box, when moving, you use many boxen.

Have you ever been chased by a giggle of kaggle of foxen?

(In collaboration with R. Patro)

Friday, April 18, 2014

How to extract data from a figure - and how good is Basis?

In science, making results reproducible and sharing the data used to generate the plots in a publication is essential. Sometimes there are cases where your eyes and gut are telling you that maybe there is something else in the data, but the data is not available. So what are we to do? We can hack the plots using a little bit of image processing!

For a while, I was using a Basis band to track a few stats about my body as I go through the day on my normal routine. It is nice to see how much you slept, the (ir)regular bed time, how heart beat goes up when you walk up the stairs or run to catch the bus -- these kinds of things. Reading about data libration efforts for Basis, I stumbled upon an article that compared heart beat readings from Basis with those from a medical grade device. The experiment was simple: the guy wore the Basis on one hand and used a Biopac device to get his ECG readings from which he later calculated heart beat. Experiment lasted 106 min resulting in a good amount of data.

What the author reports is that there is no significant correlation between the readings the two devices produce, but to me it looked that the devices mostly agreed --- if you just exclude the clearly fluke readings from Basis:
(image from the post)

The heart beat data was not available on the website, so I decided to try to extract data from one of the plots. Plot #2 had a nice distribution with clearly marked axes, so I went for that.

First, I trimmed the plot to only contain the data points (other features mess up the image analysis and may result in spurious data points): no axes, no labels. Then I used OpenCV to do all the heavy lifting image processing for me: this extensive image processing library is available in C++ and in Python. Using the library, you only need one line to read in the image:

   figure = cv2.imread(sys.argv[1])

To remove artifacts and to prime the figure for later use, I applied some blur and converted the figure to grayscale:

   figure = cv2.medianBlur(figure, 1)
   cvImage = cv2.cvtColor(figure, cv2.COLOR_BGR2GRAY)

Next, I searched for methods to detect circles in a picture -- luckily, that by itself is an area of research and a lot of work has been put into creating various solutions for this problem! Hough circle transform is a de facto standard for identifying circular objects in a picture, and OpenCV has a package implementing it:

   circles =  cv2.HoughCircles(cvImage, method=cv2.cv.CV_HOUGH_GRADIENT, dp=2, minDist=1, param1=300, param2=20, minRadius=2, maxRadius=10



As you can see in the above figures, the methods successfully identifies 78, or most, of the circles.

Now that I have the circles, I map them from picture coordinates (0 to width, 0 to height with 0,0 point in the upper left corner and Y axis pointing down) to the original plot coordinates:

   w = figure.shape[1]
   minx = 80
   maxx = 130
   real_w = maxx - minx
   biopac = [minx + c[0] * 1.0 / w * real_w for c in circles[0, :] ]

Now I can play around with original data and figure out if by removing noisy readings from Basis, we get a correlation with the readings from the superior device.

Correlation on the set of 78 data points I identified from the image is close to the correlation reported by the author of the original comparison: Pearson r = -0.046, p-value = 0.69, Spearman r = 0.263, p-value = 0.020. However, if we remove outliers in the lower range (all points with Basis hrt < 78), we do see a clear correlation in the readings from Basis and Biopac:



Here, Pearson r = 0.828, p-value = 1.8 * 10^-16 and Spearman r = 0.756, p-value 1.8 * 10^-12 -- which is really good news for a simple $100 Basis band. This filter removes 17 data points -- about 28% of the data, which is not so great, especially given the fact that we only had 78 points out of 106 minutes worth of data to begin with.

To conclude: if you have the plots, but do not have the underlying data, you can always hack the plots; Basis bands are ok in measuring high heart rate (which is important when exercising!), but data may get messier at lower heart beat. Good hacking!


UPD: Python script and the plot used above are available on GitHub.

Thursday, December 12, 2013

Fun with custom tracks in WashU EpiGenome Browser

In a course of working on a multiscale domains project (see [2]), I came to a point where I wanted to plot domains against the epigenetic tracks and associated 3C data. WashU EpiGenome Browser contains a lot of such data and also supports custom tracks -- perfect! Below I describe how to prepare a custom track for overlapping and non-overlapping domains.

Dixon et al [1] domains

Dixon et al domains are non-overlapping and so I plotted them as a categorical track in WashU EpiGenome Browser.

Initial domain file looks like this:

chr1 760000 1280000
chr1 1280000 1840000
chr1 1840000 2320000
...

First, I added a column at the end to represent a category and alternate between the categories 1 and 2 for every other domain:

chr1    760000  1280000 1
chr1    1280000 1840000 2
chr1    1840000 2320000 1
...

Now, the file looks like a bedGraph format and we can use tabix:

bgzip -c dixon.domains.bedgraph > dixon.domains.bedgraph.gz
tabix -p bed dixon.domains.bedgraph.gz

The first command gzips the bedGraph file, the second one generates an index *.tbi file. Now copy both *.gz and *.tbi to your webserver and you can load the tracks from the EpiGenome Browser using the "CustomTK" menu.

Dixon domains URL: http://www.cs.cmu.edu/~dfilippo/epitracks/dixon.domains.gz
Domains are alternating yellow and orange blocks (each block is a domain), 3C matrix for the same cell line (IMR90, or human fibroblast) is shown in pink:



Multiscale domains [2]

Multiscale domains generated by Armatus change from one gamma setting to another, and so it was important that the track supports overlapping items. Multiscale domains are similar to a collection of genes (if you consider their various isoforms) and are best represented by a bed annotation track.

As before, initial domain file looked like this:

chr20   61640000        62440000
chr20   60760000        61520000
chr20   60600000        60640000

...

Now, since we may have several domains start at the same position, we need to sort domains by the start position, or otherwise tabix will throw an error:

chr20   40000   26280000
chr20   40000   80000

chr20   40000   120000

Next, to fit in with the format of the a bed file, we need to add 3 more fields to every line: a label assigned to this domain (a simply set it to indicate what gamma the domain comes from), a category (in the case of domains I assigned unique categories to individual gamma settings), and an indicator of the forward/reverse strand (we don't really need the forward/reverse information, so it could be replaced with a "."):

chr20   40000   26280000        0.0     1       +
chr20   40000   80000   0.0     1       +
chr20   40000   120000  0.0     1       +

Now that the bed file is ready, I use tabix tools again (same as above):

bgzip -c chr20-all-gamma.bed > chr20-all-gamma.bed.gz
tabix -p bed chr20-all-gamma.bed.gz

Upload the newly created *.gz and *.tbi files to the webserver and they are ready to be used within the WashU Epigenome Browser. Below, Armatus' domains are shown in green and vary between different scales capturing many alternative domains:


The labels on the domains correspond to the length scale at which they were observed (gamma).

My multiscale domains URL: http://www.cs.cmu.edu/~dfilippo/epitracks/chr20-all-gamma.bed.gz (chromosome 20 only).

Referi


[1] Dixon et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 485 (376-380). 2012.
[2] Filippova et al. Multiscale identification of topological domains in chromatin. WABI 2013.
Preparing custom track (WashU Epigenome Browser help)
Making a custom gene track (Epi Browser author writes out some commands to make a gene track)
Browser file formats (UCSC's FAQ on file formats for the UCSC and Epigenome Browsers)


UPD: one more reference custom gene tracks (WashU Epigenome Browser help)

Friday, December 06, 2013

Twinlist: When Simple is Powerful

I do not take pills too often, mostly it's something for a headache once every couple of months or allergy whenever the flowers are in bloom. However, there are people around us that take many medications daily, especially people with chronic conditions and the elderly. Keeping track of medications within an electronic medical record (EMR) for such patients is, not surprisingly, a challenge: the patient may switch to a generic drug, but the EMR lists it under a brand name, the dosage may be off, or instead of orally the patient now takes the drug intravenously. Reconciling what the patient is actually taking and what is recorded in the system is a boring and daunting task: one has to pay attention to details like "4mg" vs "7 mg" and meticulously compare drug names -- not something a busy physician can spend a lot of time on. However, this problem shows up every time the provider's data is out of date. This could be during an emergency visit to the hospital or during a routine visit to a physician or a specialist. Keeping track of the current medications is important when adjusting or prescribing new treatment and as to avoid prescribing drugs and procedures that may be incompatible with current medications.

The Human-Computer Interaction Lab (HCIL) at University of Maryland took on this challenge and produced a simple, yet powerful way to automate medication list reconciliation (a medical term for a problem described above). Computers are good at exact comparisons -- and that's what researchers at HCIL used to find discrepancies in dosage, path of administration, and frequency for the same drugs coming from two different lists. Inexact and ambiguous matches are then presented to the user who can decide how to proceed. Through intuitive highlighting and animation that makes it easier to follow the process they have made reconciliation quick and managed to reduce human error.


What I appreciate the most about this interface is its simplicity (it is a list after all). It is easy to identify the drugs that match and the drugs that differ, drugs that are unique to the hospital and drugs that are unique to a patient. Similar drugs are aligned on the same line visually encoding the relationship. The details that differ between similar drugs are subtly highlighted in yellow as to not crowd the overall picture with too much information. The careful use of animation makes a seemingly boring task more fun. Power users may skip the animation and use the overhead menu instead.

I am very impressed with HCIL's continued success and if I ever find myself having to manage 20 different drugs, I would want my physician to use Twinlist for reconciliation! They took a mundane task that no one likes and turned it into an easy exercise for the eyes improving medical care in the end. Having seen some EMRs and what a tangled mess they are, I believe it is solutions like this -- making mundane data entry/cleaning/updating -- that can significantly improve care and save doctors some precious time.

Code and more details about this award-winning design are available on Twinlist's page.

Monday, November 11, 2013

On war and peace

A while back Google released an Ngram Viewer tool which gives an opportunity to query the texts of many Google Books all at once. One can spend many hours trying out words and their different combinations and seeing how their frequency of use changed over the centuries. It tells you about popular trends and dying ideas. Today, through a series of random steps I found myself on Ngram Viewer quering for "inference", then "war". Here is a plot of occurrences of "war" in the many books since 1800 to the present day: Two bumps around 1920ies and 1940 are the two world wars. No other time points stand out - sorry, Iraq, Afghanistan, Sri-Lanka, East Timor, and many-many others.
Well, are we doing better in terms of promoting peace? Apparently not:

We use "peace" less and less. Sure, we spoke about peace a little more often during the first and second World Wars, but the general decline in its use continues. I wonder if "peace" is simply being replaced by another word, or does it mean that we are all doomed? A quick search through synonyms of "peace" offers a glimpse of hope in "nonviolence":

and "ceasefire":

although neither of these words is being used quite as often as "peace".

Well, be it non-violence or ceasefire, there is still hope that we will survive as a human race!