Friday, April 18, 2014

How to extract data from a figure - and how good is Basis?

In science, making results reproducible and sharing the data used to generate the plots in a publication is essential. Sometimes there are cases where your eyes and gut are telling you that maybe there is something else in the data, but the data is not available. So what are we to do? We can hack the plots using a little bit of image processing!

For a while, I was using a Basis band to track a few stats about my body as I go through the day on my normal routine. It is nice to see how much you slept, the (ir)regular bed time, how heart beat goes up when you walk up the stairs or run to catch the bus -- these kinds of things. Reading about data libration efforts for Basis, I stumbled upon an article that compared heart beat readings from Basis with those from a medical grade device. The experiment was simple: the guy wore the Basis on one hand and used a Biopac device to get his ECG readings from which he later calculated heart beat. Experiment lasted 106 min resulting in a good amount of data.

What the author reports is that there is no significant correlation between the readings the two devices produce, but to me it looked that the devices mostly agreed --- if you just exclude the clearly fluke readings from Basis:
(image from the post)

The heart beat data was not available on the website, so I decided to try to extract data from one of the plots. Plot #2 had a nice distribution with clearly marked axes, so I went for that.

First, I trimmed the plot to only contain the data points (other features mess up the image analysis and may result in spurious data points): no axes, no labels. Then I used OpenCV to do all the heavy lifting image processing for me: this extensive image processing library is available in C++ and in Python. Using the library, you only need one line to read in the image:

   figure = cv2.imread(sys.argv[1])

To remove artifacts and to prime the figure for later use, I applied some blur and converted the figure to grayscale:

   figure = cv2.medianBlur(figure, 1)
   cvImage = cv2.cvtColor(figure, cv2.COLOR_BGR2GRAY)

Next, I searched for methods to detect circles in a picture -- luckily, that by itself is an area of research and a lot of work has been put into creating various solutions for this problem! Hough circle transform is a de facto standard for identifying circular objects in a picture, and OpenCV has a package implementing it:

   circles =  cv2.HoughCircles(cvImage, method=cv2.cv.CV_HOUGH_GRADIENT, dp=2, minDist=1, param1=300, param2=20, minRadius=2, maxRadius=10



As you can see in the above figures, the methods successfully identifies 78, or most, of the circles.

Now that I have the circles, I map them from picture coordinates (0 to width, 0 to height with 0,0 point in the upper left corner and Y axis pointing down) to the original plot coordinates:

   w = figure.shape[1]
   minx = 80
   maxx = 130
   real_w = maxx - minx
   biopac = [minx + c[0] * 1.0 / w * real_w for c in circles[0, :] ]

Now I can play around with original data and figure out if by removing noisy readings from Basis, we get a correlation with the readings from the superior device.

Correlation on the set of 78 data points I identified from the image is close to the correlation reported by the author of the original comparison: Pearson r = -0.046, p-value = 0.69, Spearman r = 0.263, p-value = 0.020. However, if we remove outliers in the lower range (all points with Basis hrt < 78), we do see a clear correlation in the readings from Basis and Biopac:



Here, Pearson r = 0.828, p-value = 1.8 * 10^-16 and Spearman r = 0.756, p-value 1.8 * 10^-12 -- which is really good news for a simple $100 Basis band. This filter removes 17 data points -- about 28% of the data, which is not so great, especially given the fact that we only had 78 points out of 106 minutes worth of data to begin with.

To conclude: if you have the plots, but do not have the underlying data, you can always hack the plots; Basis bands are ok in measuring high heart rate (which is important when exercising!), but data may get messier at lower heart beat. Good hacking!


UPD: Python script and the plot used above are available on GitHub.

No comments: