Thursday, January 05, 2017

2016 in review

Given some recent changes at Roche, it is unclear when the work I was engaged in will see the light of day. This is not a situation that I am entirely pleased with, but focusing on the positive, I want to take five minutes and talk about the really cool things I learned this year.

Part one -- long reads. Long reads are a pretty cool way to figure out the DNA sequence of the living things and I got to work with and learn the ins and outs of the PacBio long read technology & associated analyses. Long reads offer some significant advantages over the short reads, as in more accurate assembly and gene prediction, more confident alignments and more detailed species calling in bacterial samples. A lot of the applications are still in the research phase and there are no commercial assays based on the long reads yet afaik, but it seems to be the future. Keith Robinson has a nice summary covering all the major players trying to win over the long read market, and the competition is looking tough. For those interested in learning more about the long read technology itself, here is a video from PacBio describing their long read technology -- it is truly a feat of engineering many years in the making.

PacBio's Circular Consensus Sequencing

Part two -- viruses. Did you know that HIV virus is now so well studied that we have an annual consortium that documents how common HIV strains respond to the common HIV drugs given common HIV mutations in those strains? I didn't, but there you go. Despite being well-studied, there is no tested HIV vaccine yet and HIV is still affecting millions of people around the world. The current strategy to fight HIV is to test at risk populations and prescribe patients a combination of antiretroviral drugs to keep the virus at bay. Sequencing the viral RNA from the patient's blood gives a detailed look into their unique viral population: are all HIV molecules in the population the same? how are they different? what mutations make this population different from other well-studied populations? It also allows us to assess which drugs will or will not work for the patient -- and then design their drug cocktail based on that data instead of giving them an assortment of drugs that may not work. Long reads turn out to be quite useful here: you can study the viral evolution in detail that was not possible before,  you can phase the mutations to look at subpopulations and build a more nuanced model of drug resistance for the patient, and on and on. This extends to viruses other than HIV as well.

Part three -- bacteria. Bacteria are everywhere, and we are just starting to discover the tremendous  effects they have on our lives -- not only in a negative sense, but also the role bacteria play in our  overall health: we are finding that bacteria  are involved in everything from digestion to diabetes and rheumatoid arthritis. Sequencing bacterial DNA from the patient's stool, blood, or saliva allows us to tell which bacteria are present in the sample; sequencing bacterial RNA can give a clue to what those species are doing. From there, we can say whether or not the patient's sample looks normal, whether patient is rejecting a transplant, whether their gut flora is more like that of a lean or obese person, or whether they have any resistant bacterial strains -- and suggest a targeted antibacterial treatment without destroying the "good" bacteria. Rob Knight is one of the leading researchers in this field and he has a pretty insightful overview on how bacteria make us who we are.  Again, long reads open up new possibilities here: most microbial genes are on the order of thousand base pairs long, so if your sequencing reads are 3000-10000 base pairs long, chances are you can study these genes without worrying too much about the preprocessing steps of assembly or phasing. Other long read sequencing technologies allow streaming sample classification: if you suspect that your sample contains a particular bacterium (say, Y. pestis that causes bubonic plague), you can stop the sequencing to save $$ and time as soon as you observe enough data that suggests that Y. pestis is indeed in the sample, and in one case that took only 40 minutes as opposed to days for some infectious disease tests.

Apart from learning the biology, the lion share of my time was taken up by developing and evaluating various computational tools that address the challenges above. One of my former professors said: "We are providing computational tools to enable biologists to do their work", and now that I have worked with biologists/chemists/physicists closely, I can finally get behind that statement. However, I also see that computer scientists can contribute more than just tools for biologists: we can work on ecosystems to manage the (huge) sequencing data and data flows (like DNANexus & 7 Bridges), on enabling large analyses to discover new biology, or on providing a comprehensive sample-to-result system with full data provenance and automation. This is, perhaps, the most important thing I am learning -- where in the biotech world I fit in as a computational person and what unmet needs are out there.

p.s. I started this blog post with a completely different title, but by the end I realized that it is what it is -- a review of the past year. Hence I am proudly joining the ranks of all the people summing up their 2016 :)