Graphic image of a double-helix of DNA against a background of numerical data.

Graphic by Sergey Nivens, Shutterstock.

Stanford Medicine Scope - June 8th, 2017 - by Jennie Dusheck

In the last year, each of half a dozen stories we have done on the work of Purvesh Khatri, PhD, assistant professor of biomedical informatics at Stanford, has been a variation on the same theme — pulling gene expression gold from the dross of dirty data.

Khatri has developed an ingenious method for evaluating the expression of human genes in response to different diseases or conditions — one that is fueling efforts to identify cheap and effective way to diagnose tuberculosis or to predict which patients will most likely reject transplanted organs. Khatri can also distinguish a viral infection from a bacterial infection based on which genes the human body is expressing.

And once he knows the gene expression profile for a particular disease — the gold — that profile can be used clinically to diagnose or help individual patients.

But how exactly does he do it?

This week, Quanta published a Q&A with Khatri discussing his groundbreaking technique and the combination of serendipity and personal determination that led him to the discovery.

In particular, the article discusses how using “dirty data” with lots of noise actually produces better results than using “clean data” refined using traditional metadata techniques. An excerpt:

I don’t quite get how your approach differs from traditional meta-analysis. What’s fundamentally different?

The biggest difference is that our group ignores heterogeneity across data sets, whereas in traditional meta-analysis we are taught to reduce heterogeneity.

People say, for example, ‘I’m not going to use this sample because that patient had a different drug treatment. Or maybe these patients were early post-transplant whereas this other data set is late, five years after transplant, so I’m not going to use that data.’ In bioinformatics, we have learned to take data sets and select samples making sure there is no noise, no confounding factors.

But when we do this, it does not capture the heterogeneity of the disease. We know that. That’s why we have to replicate the findings in other cohorts.

What I’m saying is, don’t worry about the heterogeneity. Using dirty data allows you to account for clinical heterogeneity.

But to be sure that heterogeneity wasn’t going to screw up my results, I set stringent criteria for validating that statistical associations we found between genes and medical conditions were not flukes. Validation had to be done in an independent cohort that was not part of the discovery set. In other words, if a lab had more than one data set published, I made each data set either a discovery or a validation cohort a priori.

It’s a fun read and a glimpse behind the scenes on the importance of big data.

Originally published at Stanford Medicine Scope Blog