Saturday, September 05, 2015

Those pesky author lists

Having recently had to correct a lot of author lists to BibTeX-compliant list format, I wrote a little Javascript to help me do the task:



This would convert the above author names to be "Hurt, Jessica A. and Thibodeau, Stacey A. and Hirsh, Andrew S. ". I found this especially useful for those Nature papers with 10+ authors :)

Tuesday, October 21, 2014

The colors of my life

Inspired by vibrant colors of Indian outfits, I have created a new stylesheet for my MacOS terminal:


Juuust enough contrast for me, colors are not at all discordant when taken all together and definitely brighten up my day, esp in contrast to the sulky fall weather. 

La-la-la-loves it!

MacOS stylesheet file (exported from the Terminal app)

Thursday, September 11, 2014

Science communication

While there is a lot of training in how to communicate ideas clearly to other scientists, the need and specific aspects of speaking to a more general public are rarely addressed in PhD programs. It may seem that if you can give a clear and concise talk on your latest findings at a conference, then you should easily be able to convey the same information to an audience of people without all the specialized training you have invested in. When this approach fails, it is easy to blame the audience: "If only they understood science better!", or claim that the knowledge gap is too wide and the audience should just trust you, the scientist, because you have the training and experience. This is not an example of successful communication, but it can be made into one. John Radzilowicz, a speaker at CMU's Public Communication for Researchers (PCR) seminar outlines 10 important points for successful communication to non-scientific audience:

  1. Know your audience (also: rule #1 from English101). Assess what your audience might already know and move past it -- people will enjoy learning new things rather than covering the same high school material all over again. Are there any big debates in the area you are covering? Be prepared to offer some comment on them if asked, however, it is fine to admit that this is out of your area of expertise.
  2. Understand the goals of your communication. What are you trying to achieve by giving this talk: inspire and wow people? dispel some myths? gather support for formal education? make knowledge more accessible (e.g. explain genetic testing)? Giving your goals, try to identify one or two take home messages and make sure to repeat these several times throughout your presentation.
  3. Report accurate data, but try to make it accessible. An example is: how many Sun masses should the star be to become a supernova? The precise answer may be 8.2 - 15.6 Sun masses. But simply saying that "If a star is around 10 Sun masses, it may become a supernova" will make it easier for your audience to remember the number -- and it is still within the exact range! Try to explain things in the simplest way, act as a "science translator".
  4. Be fun and engaging. This could be hard to some of us who live in their underground labs and see no sunlight, but simply smiling as the audience pours into the room and asking them "How are you doing?" before launching into the talk increases retention of information drastically! A dialog is the best thing that can happen.
  5. Remember that your audience is not a blank slate. They come from a variety of backgrounds, belief systems, and have varying levels of knowledge. Ask them what they know -- and try to incorporate that into your talk. Survey their beliefs -- and be respectful of them, even if you do not agree with some of it. This also helps to keep your audience engaged.
  6. Establish trust. Explain how we know what we know. It helps to note which phenomena we can explain and which ones are still a mystery. Do not shy away from facts that were once accepted truth and were later found to be false -- this is the self-correcting power of science that makes it trustworthy. Share the sense of wonder and excitement with your audience -- it could be inspirational.
  7. Be ready to give more than just facts. You might be prepared to talk about your favorite topic, but you may find yourself explaining the scientific process, how experiments are set up and theories tested -- be ready to fill such gaps if the situation demands it. You might need to cover some of the parts multiple times to get the point across or go into more details -- it is part of science exploration.
  8. Acknowledge the humanity of science. Science is done be humans and humans make mistakes. There is plagiarism and fraud, yet science has an amazing self-correcting quality. It may take years, but the truth comes out (how about them arsenic life forms). Science can be used for good or for evil -- by people, but that does not make the knowledge better or worse.
  9. Leave them wanting more -- you will have to leave out some of the details to make the talk more accessible, but mention that "the whole picture is more complex". For the curious, this will be enough to pique their interest.
  10. "Walk the walk". If you want to be better at science communication, you have to start learning more about it: watch documentaries by/about such public science communicators as Carl Sagan, Richard Feynman, and Neil deGrasse Tyson. Learn from the best!
And now I will practice these in conversations with my mom :)





Thursday, June 12, 2014

Plurals in English

If mouse becomes mice, then many a house become hice.
And if a singular for mice is a mouse, then singular for lice is a louse!
And if a singular for mice is a mouse, then singular for many grains of rice is a rouse!

If you have a single ox, then many of them are oxen.
So, if you have a single box, when moving, you use many boxen.

Have you ever been chased by a giggle of kaggle of foxen?

(In collaboration with R. Patro)

Friday, April 18, 2014

How to extract data from a figure - and how good is Basis?

In science, making results reproducible and sharing the data used to generate the plots in a publication is essential. Sometimes there are cases where your eyes and gut are telling you that maybe there is something else in the data, but the data is not available. So what are we to do? We can hack the plots using a little bit of image processing!

For a while, I was using a Basis band to track a few stats about my body as I go through the day on my normal routine. It is nice to see how much you slept, the (ir)regular bed time, how heart beat goes up when you walk up the stairs or run to catch the bus -- these kinds of things. Reading about data libration efforts for Basis, I stumbled upon an article that compared heart beat readings from Basis with those from a medical grade device. The experiment was simple: the guy wore the Basis on one hand and used a Biopac device to get his ECG readings from which he later calculated heart beat. Experiment lasted 106 min resulting in a good amount of data.

What the author reports is that there is no significant correlation between the readings the two devices produce, but to me it looked that the devices mostly agreed --- if you just exclude the clearly fluke readings from Basis:
(image from the post)

The heart beat data was not available on the website, so I decided to try to extract data from one of the plots. Plot #2 had a nice distribution with clearly marked axes, so I went for that.

First, I trimmed the plot to only contain the data points (other features mess up the image analysis and may result in spurious data points): no axes, no labels. Then I used OpenCV to do all the heavy lifting image processing for me: this extensive image processing library is available in C++ and in Python. Using the library, you only need one line to read in the image:

   figure = cv2.imread(sys.argv[1])

To remove artifacts and to prime the figure for later use, I applied some blur and converted the figure to grayscale:

   figure = cv2.medianBlur(figure, 1)
   cvImage = cv2.cvtColor(figure, cv2.COLOR_BGR2GRAY)

Next, I searched for methods to detect circles in a picture -- luckily, that by itself is an area of research and a lot of work has been put into creating various solutions for this problem! Hough circle transform is a de facto standard for identifying circular objects in a picture, and OpenCV has a package implementing it:

   circles =  cv2.HoughCircles(cvImage, method=cv2.cv.CV_HOUGH_GRADIENT, dp=2, minDist=1, param1=300, param2=20, minRadius=2, maxRadius=10



As you can see in the above figures, the methods successfully identifies 78, or most, of the circles.

Now that I have the circles, I map them from picture coordinates (0 to width, 0 to height with 0,0 point in the upper left corner and Y axis pointing down) to the original plot coordinates:

   w = figure.shape[1]
   minx = 80
   maxx = 130
   real_w = maxx - minx
   biopac = [minx + c[0] * 1.0 / w * real_w for c in circles[0, :] ]

Now I can play around with original data and figure out if by removing noisy readings from Basis, we get a correlation with the readings from the superior device.

Correlation on the set of 78 data points I identified from the image is close to the correlation reported by the author of the original comparison: Pearson r = -0.046, p-value = 0.69, Spearman r = 0.263, p-value = 0.020. However, if we remove outliers in the lower range (all points with Basis hrt < 78), we do see a clear correlation in the readings from Basis and Biopac:



Here, Pearson r = 0.828, p-value = 1.8 * 10^-16 and Spearman r = 0.756, p-value 1.8 * 10^-12 -- which is really good news for a simple $100 Basis band. This filter removes 17 data points -- about 28% of the data, which is not so great, especially given the fact that we only had 78 points out of 106 minutes worth of data to begin with.

To conclude: if you have the plots, but do not have the underlying data, you can always hack the plots; Basis bands are ok in measuring high heart rate (which is important when exercising!), but data may get messier at lower heart beat. Good hacking!


UPD: Python script and the plot used above are available on GitHub.

Thursday, December 12, 2013

Fun with custom tracks in WashU EpiGenome Browser

In a course of working on a multiscale domains project (see [2]), I came to a point where I wanted to plot domains against the epigenetic tracks and associated 3C data. WashU EpiGenome Browser contains a lot of such data and also supports custom tracks -- perfect! Below I describe how to prepare a custom track for overlapping and non-overlapping domains.

Dixon et al [1] domains

Dixon et al domains are non-overlapping and so I plotted them as a categorical track in WashU EpiGenome Browser.

Initial domain file looks like this:

chr1 760000 1280000
chr1 1280000 1840000
chr1 1840000 2320000
...

First, I added a column at the end to represent a category and alternate between the categories 1 and 2 for every other domain:

chr1    760000  1280000 1
chr1    1280000 1840000 2
chr1    1840000 2320000 1
...

Now, the file looks like a bedGraph format and we can use tabix:

bgzip -c dixon.domains.bedgraph > dixon.domains.bedgraph.gz
tabix -p bed dixon.domains.bedgraph.gz

The first command gzips the bedGraph file, the second one generates an index *.tbi file. Now copy both *.gz and *.tbi to your webserver and you can load the tracks from the EpiGenome Browser using the "CustomTK" menu.

Dixon domains URL: http://www.cs.cmu.edu/~dfilippo/epitracks/dixon.domains.gz
Domains are alternating yellow and orange blocks (each block is a domain), 3C matrix for the same cell line (IMR90, or human fibroblast) is shown in pink:



Multiscale domains [2]

Multiscale domains generated by Armatus change from one gamma setting to another, and so it was important that the track supports overlapping items. Multiscale domains are similar to a collection of genes (if you consider their various isoforms) and are best represented by a bed annotation track.

As before, initial domain file looked like this:

chr20   61640000        62440000
chr20   60760000        61520000
chr20   60600000        60640000

...

Now, since we may have several domains start at the same position, we need to sort domains by the start position, or otherwise tabix will throw an error:

chr20   40000   26280000
chr20   40000   80000

chr20   40000   120000

Next, to fit in with the format of the a bed file, we need to add 3 more fields to every line: a label assigned to this domain (a simply set it to indicate what gamma the domain comes from), a category (in the case of domains I assigned unique categories to individual gamma settings), and an indicator of the forward/reverse strand (we don't really need the forward/reverse information, so it could be replaced with a "."):

chr20   40000   26280000        0.0     1       +
chr20   40000   80000   0.0     1       +
chr20   40000   120000  0.0     1       +

Now that the bed file is ready, I use tabix tools again (same as above):

bgzip -c chr20-all-gamma.bed > chr20-all-gamma.bed.gz
tabix -p bed chr20-all-gamma.bed.gz

Upload the newly created *.gz and *.tbi files to the webserver and they are ready to be used within the WashU Epigenome Browser. Below, Armatus' domains are shown in green and vary between different scales capturing many alternative domains:


The labels on the domains correspond to the length scale at which they were observed (gamma).

My multiscale domains URL: http://www.cs.cmu.edu/~dfilippo/epitracks/chr20-all-gamma.bed.gz (chromosome 20 only).

Referi


[1] Dixon et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 485 (376-380). 2012.
[2] Filippova et al. Multiscale identification of topological domains in chromatin. WABI 2013.
Preparing custom track (WashU Epigenome Browser help)
Making a custom gene track (Epi Browser author writes out some commands to make a gene track)
Browser file formats (UCSC's FAQ on file formats for the UCSC and Epigenome Browsers)


UPD: one more reference custom gene tracks (WashU Epigenome Browser help)

Friday, December 06, 2013

Twinlist: When Simple is Powerful

I do not take pills too often, mostly it's something for a headache once every couple of months or allergy whenever the flowers are in bloom. However, there are people around us that take many medications daily, especially people with chronic conditions and the elderly. Keeping track of medications within an electronic medical record (EMR) for such patients is, not surprisingly, a challenge: the patient may switch to a generic drug, but the EMR lists it under a brand name, the dosage may be off, or instead of orally the patient now takes the drug intravenously. Reconciling what the patient is actually taking and what is recorded in the system is a boring and daunting task: one has to pay attention to details like "4mg" vs "7 mg" and meticulously compare drug names -- not something a busy physician can spend a lot of time on. However, this problem shows up every time the provider's data is out of date. This could be during an emergency visit to the hospital or during a routine visit to a physician or a specialist. Keeping track of the current medications is important when adjusting or prescribing new treatment and as to avoid prescribing drugs and procedures that may be incompatible with current medications.

The Human-Computer Interaction Lab (HCIL) at University of Maryland took on this challenge and produced a simple, yet powerful way to automate medication list reconciliation (a medical term for a problem described above). Computers are good at exact comparisons -- and that's what researchers at HCIL used to find discrepancies in dosage, path of administration, and frequency for the same drugs coming from two different lists. Inexact and ambiguous matches are then presented to the user who can decide how to proceed. Through intuitive highlighting and animation that makes it easier to follow the process they have made reconciliation quick and managed to reduce human error.


What I appreciate the most about this interface is its simplicity (it is a list after all). It is easy to identify the drugs that match and the drugs that differ, drugs that are unique to the hospital and drugs that are unique to a patient. Similar drugs are aligned on the same line visually encoding the relationship. The details that differ between similar drugs are subtly highlighted in yellow as to not crowd the overall picture with too much information. The careful use of animation makes a seemingly boring task more fun. Power users may skip the animation and use the overhead menu instead.

I am very impressed with HCIL's continued success and if I ever find myself having to manage 20 different drugs, I would want my physician to use Twinlist for reconciliation! They took a mundane task that no one likes and turned it into an easy exercise for the eyes improving medical care in the end. Having seen some EMRs and what a tangled mess they are, I believe it is solutions like this -- making mundane data entry/cleaning/updating -- that can significantly improve care and save doctors some precious time.

Code and more details about this award-winning design are available on Twinlist's page.

Monday, November 11, 2013

On war and peace

A while back Google released an Ngram Viewer tool which gives an opportunity to query the texts of many Google Books all at once. One can spend many hours trying out words and their different combinations and seeing how their frequency of use changed over the centuries. It tells you about popular trends and dying ideas. Today, through a series of random steps I found myself on Ngram Viewer quering for "inference", then "war". Here is a plot of occurrences of "war" in the many books since 1800 to the present day: Two bumps around 1920ies and 1940 are the two world wars. No other time points stand out - sorry, Iraq, Afghanistan, Sri-Lanka, East Timor, and many-many others.
Well, are we doing better in terms of promoting peace? Apparently not:

We use "peace" less and less. Sure, we spoke about peace a little more often during the first and second World Wars, but the general decline in its use continues. I wonder if "peace" is simply being replaced by another word, or does it mean that we are all doomed? A quick search through synonyms of "peace" offers a glimpse of hope in "nonviolence":

and "ceasefire":

although neither of these words is being used quite as often as "peace".

Well, be it non-violence or ceasefire, there is still hope that we will survive as a human race!

Thursday, August 15, 2013

Alternative domains highlighted in a review

Spatial proximity between far-away parts of DNA has implications for DNA accessibility, gene expression and regulation. The information about which parts of DNA are physically co-located in the nucleus can be obtained through various Hi-C experiments: 3C (chromatin conformation capture), 4C (circularized 3C), 5C (carbon copy 3C), and some others. Our research group has been working a lot with Hi-C lately, so I made it a habit to scan my Feedly collection of titles from various journals for new relevant Hi-C material. As I was browsing, I saw this article in Nature Reviews talking about Hi-C data. The keypoints section drew my attention because our research group has developed some methods that fit with these outlined research problems almost directly! In particular, I wanted to highlight these two points:
  • Mining increasingly comprehensive chromatin interaction maps for chromosomal domains and complete genomes requires novel computational methods and modeling tools.
  • Statistical analysis of Hi-C data identifies multiple scales of domain organization: larger (1–10 Mb) chromosomal compartments and smaller (less than 1 Mb) topologically associating domains
Topological domains were first mentioned in a paper by Dixon et al. that appeared in Nature last year. Domains are contiguous segments of DNA that self-interact a lot more frequently than they interact with the rest of the chromosome. The original paper proposed an HMM method for finding domains and reported a collection of domains found by their algorithm. The article got cited over a hundred times since then and these domains have appeared in multiple analyses.

However, as the review above suggests, there may be other domains that overlap, nest in, or completely contain domains reported by Dixon et al. One look at the interaction matrix is enough to convince oneself of their existence:
Here I plotted a submatrix of human chromosome 22 (hESC cells) with each cell in the matrix representing 40Kbp of sequence. I highlighted several potential domains in purple, but there are many more just in this one snapshot.

Guided by what we saw in the matrices, we have developed a simple dynamic programming approach that finds alternative domains of various sizes. When applied to the same Hi-C interaction matrices that were used by Dixon et al, we were able to identify domains that are significantly different from Dixon's domains, yet are, in some cases, more enriched for certain chromatin marks. We are working on an extension to this problem that will find other optimal solutions to our formulation and plan on releasing the code soon. Our preprint is already posted on the ArXiv and will appear in this year's WABI, and now there is a review that calls for methods like ours to search for alternative domains. What a boost of confidence!

Sunday, November 18, 2012

Dental health insurance

Going to a dentist is a stressful experience for most of us. However, trying to understand your dental insurance may stress you out even more, especially if your insurance is anything like mine. Here is a piece of paper that claims to "explain" the charges from my recent visit to a dentist to get a filling:

I studied this piece of paper in detail for some time, and, after trudging past the quirky vocabulary, I remain puzzled:
Problem 1: What? Does $200 + $200 + $200 + $200 equal $400?
Problem 2: What is "Allowance"? What determines how much the insurance company will pay for a procedure?
Problem 3: How is "amount paid" determined? Why is the "allowed amount" on line 4 $60, but insurance company decides to pay only $16? How is line 4 different from line 2: same procedure, same charge and allowance, but insurance company made a different call on that. Why?
Problem 4: What are these codes in the "Remarks" section? Aaaah, there they are on the next page. At least an indication of what these codes are would be helpful. Or, you can go crazy and explain them away on this very page in the whitespace below the table!

The company's website, where I went to check what I am covered for, is also... wanting. Here is an example of what one sees when they go to check their coverage:
Looks reasonable, doesn't it? But hold on, let's check out the coverage details for the fillings:
Whoa, how do I navigate through this mess? Where are the fillings?
Problem 1: Is this a list? Is this free text?  Are "Post removal" and "Enamel Microabrasion" the only two procedures (due to the visual cues - the dashes) or is every line a separate procedure?
Problem 2: Are the procedures ordered in any reasonable way, say, alphabetic, making it easier to navigate? How can one search for a specific procedure other than by reading the whole list?
Problem 3: What are these codes in parentheses? I guess these are procedure codes, but how do I know if my dentist is about to install D2391 and not D2394 into my mouth? And since there are codes for everything, why isn't there a code for the "comparable amalgam filling"?
Problem 4: Why wouldn't the text span the whole line?..

I think you get the point: either I'm a high school dropout, or this is really hard to read for a common person. A couple of gentle touches will make the "EOB" better, same goes for coverage - just adding some organization to the text improves its readability dramatically:
(the sad truth is nothing is covered). 
After trying to understand my dental benefits and trying to read  the "explanation" of such, I can only agree with our president: yes, we need a reform in health care. But this reform has to start at the very beginning, with people being able to understand their insurance. Improving the communication and readability of the benefits and coverage is the easiest step to make and, possibly, one of the most important.