Thursday, September 10, 2020

Making a comfortable dev environment

Coding? Debugging? Configuring? These are a few simple things that have been very helpful to me and  my colleagues.

1. Bash prompt++. 

It helps to put your machine's name, dir, and branch name into the bash prompt to save you from 1) running whoami or ipconfig a thousand times 2) running pwd a thousand times 3) running git status every other minute. Here is mine:

parse_git_branch() {
     git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/ (\1)/'
}
PS1="\[\e[34m\]\h\[\e[m\]:\[\e[33m\]\w\[\e[00m\]\$(parse_git_branch)$ "

It should look something like this:

myMacYeah:~/dev/betteromics/ (df/fix_commit)$

where myMacYeah is my machine, followed by directory, and then the branch name I am currently on

2. Aliases

Create aliases, please! For so many years, I deprived myself of this pleasure to speak in a more consise command language that only me and my machine would understand... No more! ls -ls becomes an easy ll, git branch -v becomes a friendly gbra (or should I have gone w/ g-bro? 😆), gsta substitutes git status. Boldly go into the unknown!

3. Git commands autocomplete

This is just plain poetry for thy tired eyes (and fingers) when you don't need to fully type out all the delicate git commands, one letter after another. Download, add a couple of lines to your bash_profile, and live in a fast lane. Thank you so much, Shawn O'Pearce!

# add git commands autocomplete
if [ -f ~/.git-completion.bash ]; then
. ~/.git-completion.bash
fi

Saturday, October 06, 2018

Lyrics for the pop

This post is about contemporary pop music of which I hear plenty on the way to and from work. If you could care less about the music -- tune out now.

Little girls (teenage girls?), listen up! Here are some examples of songs that are total nonsense and you should delete them from your playlist:

Demi Lovato "In Case" -- no, do not wait around for anyone who left you on their own accord. Dump that dirty jacket and move on with your life. You are yet to meet 7bln people who may be more deserving.

Ariana Grande's "One Last Time" -- these lyrics should burn in hell. Do not ever turn this song on if you have a shred of self-respect. "Baby I don't care if you got her in your heart | All I really care is you wake up in my arms" -- really? Woman up, and let the other person be! They already moved on, so should you. And deal w/ the consequences of your own lies ("I was a liar"). And never do that thing you did wrong again. There, I fixed it for you.

On the other hand, these same artists have songs w/ a totally different attitude that I applaud:

Demi Lovato "Really Don't Care", "Firestater" -- these are good! There is no self-deprecation, no yearning for those who deserted, there's only jumping of the moving train (don't do this unless you know how to land well) and being a badass. Yeah!

Ariana Grande's "Break Free" -- I wish everyone had the ability / option to break free when they found themselves in a bad situation or relationship. "I don't wanna hear you lie tonight | Now that I've become who I really am" -- you go, girl, do what's right for you. Do not hang around liars or other people who hurt you or hold you back. You can do better. You will do better. Warning: the video is pretty goofy and you have to be ok w/ DJs and dancing on the bridge of a Star Trek starship.

Sia, "1000 Forms of Fear" album -- surprisingly, I am yet to find a song I do not like on this album. No self-deprecation, no whining, no holding onto the past. I've listened to it many times now, and it all rocks. At times dark, but solid.

Other recent artists w/ mostly acceptable lyrics: Daya, The Weekend, Adele (although I can not imagine her ever having an upbeat release, hehe).

Why did I write all this? Well, because so much of today's pop is such garbage and sends the wrong messages. A lot of stuff I hear on the radio is like this: you went to the club, I saw you, we went to my place, you have a nice bod, repeat. No substance, no dialogue, no clever rhymes, no moral questions the hero has to find answers to. And if my guess is correct, this is the stuff that filters down to kids and teenagers. So I want us to do better as humanity. Even "Be best" 😂. Now go and clean up your playlist.


Thursday, January 05, 2017

2016 in review

Given some recent changes at Roche, it is unclear when the work I was engaged in will see the light of day. This is not a situation that I am entirely pleased with, but focusing on the positive, I want to take five minutes and talk about the really cool things I learned this year.

Part one -- long reads. Long reads are a pretty cool way to figure out the DNA sequence of the living things and I got to work with and learn the ins and outs of the PacBio long read technology & associated analyses. Long reads offer some significant advantages over the short reads, as in more accurate assembly and gene prediction, more confident alignments and more detailed species calling in bacterial samples. A lot of the applications are still in the research phase and there are no commercial assays based on the long reads yet afaik, but it seems to be the future. Keith Robinson has a nice summary covering all the major players trying to win over the long read market, and the competition is looking tough. For those interested in learning more about the long read technology itself, here is a video from PacBio describing their long read technology -- it is truly a feat of engineering many years in the making.

PacBio's Circular Consensus Sequencing

Part two -- viruses. Did you know that HIV virus is now so well studied that we have an annual consortium that documents how common HIV strains respond to the common HIV drugs given common HIV mutations in those strains? I didn't, but there you go. Despite being well-studied, there is no tested HIV vaccine yet and HIV is still affecting millions of people around the world. The current strategy to fight HIV is to test at risk populations and prescribe patients a combination of antiretroviral drugs to keep the virus at bay. Sequencing the viral RNA from the patient's blood gives a detailed look into their unique viral population: are all HIV molecules in the population the same? how are they different? what mutations make this population different from other well-studied populations? It also allows us to assess which drugs will or will not work for the patient -- and then design their drug cocktail based on that data instead of giving them an assortment of drugs that may not work. Long reads turn out to be quite useful here: you can study the viral evolution in detail that was not possible before,  you can phase the mutations to look at subpopulations and build a more nuanced model of drug resistance for the patient, and on and on. This extends to viruses other than HIV as well.

Part three -- bacteria. Bacteria are everywhere, and we are just starting to discover the tremendous  effects they have on our lives -- not only in a negative sense, but also the role bacteria play in our  overall health: we are finding that bacteria  are involved in everything from digestion to diabetes and rheumatoid arthritis. Sequencing bacterial DNA from the patient's stool, blood, or saliva allows us to tell which bacteria are present in the sample; sequencing bacterial RNA can give a clue to what those species are doing. From there, we can say whether or not the patient's sample looks normal, whether patient is rejecting a transplant, whether their gut flora is more like that of a lean or obese person, or whether they have any resistant bacterial strains -- and suggest a targeted antibacterial treatment without destroying the "good" bacteria. Rob Knight is one of the leading researchers in this field and he has a pretty insightful overview on how bacteria make us who we are.  Again, long reads open up new possibilities here: most microbial genes are on the order of thousand base pairs long, so if your sequencing reads are 3000-10000 base pairs long, chances are you can study these genes without worrying too much about the preprocessing steps of assembly or phasing. Other long read sequencing technologies allow streaming sample classification: if you suspect that your sample contains a particular bacterium (say, Y. pestis that causes bubonic plague), you can stop the sequencing to save $$ and time as soon as you observe enough data that suggests that Y. pestis is indeed in the sample, and in one case that took only 40 minutes as opposed to days for some infectious disease tests.

Apart from learning the biology, the lion share of my time was taken up by developing and evaluating various computational tools that address the challenges above. One of my former professors said: "We are providing computational tools to enable biologists to do their work", and now that I have worked with biologists/chemists/physicists closely, I can finally get behind that statement. However, I also see that computer scientists can contribute more than just tools for biologists: we can work on ecosystems to manage the (huge) sequencing data and data flows (like DNANexus & 7 Bridges), on enabling large analyses to discover new biology, or on providing a comprehensive sample-to-result system with full data provenance and automation. This is, perhaps, the most important thing I am learning -- where in the biotech world I fit in as a computational person and what unmet needs are out there.

p.s. I started this blog post with a completely different title, but by the end I realized that it is what it is -- a review of the past year. Hence I am proudly joining the ranks of all the people summing up their 2016 :)

Monday, August 15, 2016

Graduate school and industry (not versus)

It has been almost a year now that I have defended my PhD and joined a team at a biotech company instead of further pursuing academic positions. I had time to calm down, recharge, and get acclimated to a new-ish life. Some reflection about these two different worlds is due. Here we go. Things I like about biotech:
  • Everything has a very practical value. If it is faster -- do it, if it is more accurate -- do it, if it represents actual biology better -- do it.
  • We survey customer needs and design products in such a way that they benefit most customers. It makes sense -- such an approach should bring more money (duh), but it also has the potential to improve diagnosis/treatments for more people. And that's great.
  • We can run many wet lab experiments to test an idea. It is amazing to know that your approach does work on real data.
  • I went to more conferences and workshops this year than any year before this, and now we are starting collaborations with some people I only admired from afar before.
  • Almost everyone I work with is either a PhD, or has tremendous breadth of knowledge in biology, chemistry, or physics, and I can ask them questions!
  • I already had a promotion -- and didn't have to wait 5 years for it.
  • Good pay (although we still can not afford a house, haha).
Things that could be better:
  • My commute
  • Sometimes I feel like an outsider when I get excited about computational methods (just for the methods' sake, you know!). But everyone loves that I can parse a *.docx file to extract some sentences and I know how to set up a quick webservice to expand ambiguous bases, so that compensates.
  • Designing to customer needs. The customers do not always know what is possible and what may solve their (unobvious) needs. Solution to this -- being creative and investing in research instead of always retroactively following customer needs.
Now on to graduate school, the praise and criticisms in equal measure. First, things I liked about graduate school:
  • Interesting, driven, self-starting people all around you -- and you get to work with them!
  • Publishing. Publishing is having your efforts recognized and valued, your argument proven solid and if it gets cited -- hell, you've made a contribution to science! It feels almost like achieving immortal glory in battle and becoming the stuff of the legends.
  • Having more time to study obscure comp sci literature and methods.
  • Exploring ideas that may not be immediately practical, but are so out of the box! And then realizing that they actually have cool practical applications.
  • Giving talks (see publishing)
  • Teaching -- a huge time sink, but makes you finally learn the dynamic programming for RNA folding and makes you feel super-smart and important when undergraduates ask you questions about your research.
  • Blending of disciplines, lots of talks & lectures across departments, and cool volunteer opportunities (like SCS4All or TechNights).
  • Having a decent gym very close to the office.

And then things that could be better:
  • There were too few girls, and now I do not have quite as many female friends as I have male friends.
  • Worrying about your work being scooped.
  • Constant turnover -- people graduate all the time, and your friends (or you), people you love working with, eventually move away
  • Because publication is a unit of success and it is easier to fund a new project that to request funding to support an old one, projects are often abandoned after their publication.

My 5 cents.

Sunday, June 12, 2016

Those pesky bugs

I have found a bug in the code featured in the previous post where some of the reads hitting several anchors were mapped incorrectly. So to set the record straight, here are the updated mapping rates for the simple code that does not yet handle mismatches or indels:


Dataset% reads w/ a unique correct mapping% reads w/ at last one correct mapping
1Mln reads, 100bp91.07100.00
300K reads, 300bp93.32100.00
10K reads, 1000bp94.84100.00

I wish that all bugs were like this and automagically improved performance :) Well, onto the mismatches then...

Tuesday, May 24, 2016

How to align reads -- sparse index

So, apparently coming up with an aligner for reads -- long, short, or otherwise, is not all that difficult, especially when you have no errors. All you need to do is this:
  1. Sample kmers from your reference and store the locations of where these kmers occur ("anchors").
  2. While streaming your reads in, for every kmer in the read: if kmer is an anchor, add it to the list of anchors. For most reads, one anchor will be enough to correctly compute its location in the reference. For others, you may need to hit other anchors to disambiguate between multiple anchor hits.
  3. Profit.

I wrote some code that does just that: first, sample 20-mers at every 50-th position of human chr20. Then simulate the error-free reads by drawing N random numbers from [0, |C| - |R|+1] with |C| being the chromosome length and |R| being the read length. The most exciting things about aligning such reads is: for reads that hit multiple anchors, how can we efficiently find pair(s) of anchors that would be consistent with where anchors occur in the read? Some confusing situations may occur (green: reference, orange: anchors, blue: reads) that you'd want to resolve -- a quick solution to the rescue.



Now to some results:

Dataset% reads w/ a unique correct mapping% reads w/ at last one correct mapping
1Mln reads, 100bp89.1998.28
300K reads, 300bp91.7298.35
10K reads, 1000bp93.4298.29

Pretty good overall (UPD: bwa does superbly on this simple task by getting 100% of the alignments correctly for all three datasets). Now, more interesting points are: how do errors affect this mapping rate? My intuition tells me that it would be more of a problem for shorter reads (less chances to hit the anchor w/ error-free sequence) than for the long reads. Can you select anchors in a more reasonable way? Ideally, you would want less anchors to make such an index smaller, but at the same time you'd want to guarantee that a read of a given length hits at least one anchor. And what can you do for reads that hit no anchors at all (the anchor may be near)? I have some ideas, but will have to see if it works out and how that affects performance.
UPD2: this approach is very similar to how BLAST and BLAT get their initial seeds before performing full SW.

Tuesday, December 08, 2015

Aligning long reads

With long reads becoming more popular and accessible, major read aligners responded by providing mapping capabilities that can work with long, high-error reads. STAR's Alex Dobin has put out a tool STARlong that can map PacBio's Circular Consensus Seqeuencong (CCS) reads. You would need to install it alongside your regular STAR binary (instructions here). Once you have that, generate an index for you genome and tune some parameters to get the alignments:
# assuming STARlong is installed
# generate an index for your genome:
...
# align reads
STARlong --runThreadN 20 --runMode alignReads --seedPerReadNmax 10000 --genomeDir  --readFilesIn 

BWA's Heng Li has also provided a version of bwa mapper that can align Oxford Nanopore and PacBio's long subreads and shorter CCS reads. Here is how he suggests running the alignments:
bwa index ref.fa
bwa mem -x pacbio ref.fa pacbio.fq > aln.sam
bwa mem -x ont2d ref.fa ont-2D.fq > aln.sam

Further discussions on aligning long Nanopore and PacBio reads:
Biostars ] [ choosing STAR parameters for long reads ] [ STAR parameters to use with IsoSeq reads ]

Saturday, September 05, 2015

Those pesky author lists

Having recently had to correct a lot of author lists to BibTeX-compliant list format, I wrote a little Javascript to help me do the task:



This would convert the above author names to be "Hurt, Jessica A. and Thibodeau, Stacey A. and Hirsh, Andrew S. ". I found this especially useful for those Nature papers with 10+ authors :)

Tuesday, October 21, 2014

The colors of my life

Inspired by vibrant colors of Indian outfits, I have created a new stylesheet for my MacOS terminal:


Juuust enough contrast for me, colors are not at all discordant when taken all together and definitely brighten up my day, esp in contrast to the sulky fall weather. 

La-la-la-loves it!

MacOS stylesheet file (exported from the Terminal app)

Thursday, September 11, 2014

Science communication

While there is a lot of training in how to communicate ideas clearly to other scientists, the need and specific aspects of speaking to a more general public are rarely addressed in PhD programs. It may seem that if you can give a clear and concise talk on your latest findings at a conference, then you should easily be able to convey the same information to an audience of people without all the specialized training you have invested in. When this approach fails, it is easy to blame the audience: "If only they understood science better!", or claim that the knowledge gap is too wide and the audience should just trust you, the scientist, because you have the training and experience. This is not an example of successful communication, but it can be made into one. John Radzilowicz, a speaker at CMU's Public Communication for Researchers (PCR) seminar outlines 10 important points for successful communication to non-scientific audience:

  1. Know your audience (also: rule #1 from English101). Assess what your audience might already know and move past it -- people will enjoy learning new things rather than covering the same high school material all over again. Are there any big debates in the area you are covering? Be prepared to offer some comment on them if asked, however, it is fine to admit that this is out of your area of expertise.
  2. Understand the goals of your communication. What are you trying to achieve by giving this talk: inspire and wow people? dispel some myths? gather support for formal education? make knowledge more accessible (e.g. explain genetic testing)? Giving your goals, try to identify one or two take home messages and make sure to repeat these several times throughout your presentation.
  3. Report accurate data, but try to make it accessible. An example is: how many Sun masses should the star be to become a supernova? The precise answer may be 8.2 - 15.6 Sun masses. But simply saying that "If a star is around 10 Sun masses, it may become a supernova" will make it easier for your audience to remember the number -- and it is still within the exact range! Try to explain things in the simplest way, act as a "science translator".
  4. Be fun and engaging. This could be hard to some of us who live in their underground labs and see no sunlight, but simply smiling as the audience pours into the room and asking them "How are you doing?" before launching into the talk increases retention of information drastically! A dialog is the best thing that can happen.
  5. Remember that your audience is not a blank slate. They come from a variety of backgrounds, belief systems, and have varying levels of knowledge. Ask them what they know -- and try to incorporate that into your talk. Survey their beliefs -- and be respectful of them, even if you do not agree with some of it. This also helps to keep your audience engaged.
  6. Establish trust. Explain how we know what we know. It helps to note which phenomena we can explain and which ones are still a mystery. Do not shy away from facts that were once accepted truth and were later found to be false -- this is the self-correcting power of science that makes it trustworthy. Share the sense of wonder and excitement with your audience -- it could be inspirational.
  7. Be ready to give more than just facts. You might be prepared to talk about your favorite topic, but you may find yourself explaining the scientific process, how experiments are set up and theories tested -- be ready to fill such gaps if the situation demands it. You might need to cover some of the parts multiple times to get the point across or go into more details -- it is part of science exploration.
  8. Acknowledge the humanity of science. Science is done be humans and humans make mistakes. There is plagiarism and fraud, yet science has an amazing self-correcting quality. It may take years, but the truth comes out (how about them arsenic life forms). Science can be used for good or for evil -- by people, but that does not make the knowledge better or worse.
  9. Leave them wanting more -- you will have to leave out some of the details to make the talk more accessible, but mention that "the whole picture is more complex". For the curious, this will be enough to pique their interest.
  10. "Walk the walk". If you want to be better at science communication, you have to start learning more about it: watch documentaries by/about such public science communicators as Carl Sagan, Richard Feynman, and Neil deGrasse Tyson. Learn from the best!
And now I will practice these in conversations with my mom :)