Explosion of Big Data, But Scientists Can't Keep Up

Biomedical research is going big-time: Megaprojects that collect vast stores of data are proliferating rapidly. But scientists' ability to make sense of all that information isn't keeping up.

This conundrum took center stage at a meeting of patient advocates, called Partnering For Cures, in New York City on Nov. 15.

On the one hand, there's an embarrassment of riches, as billions of dollars are spent on these megaprojects.

There's the White House's Cancer Moonshot (which seeks to make 10 years of progress in cancer research over the next five years), the Precision Medicine Initiative (which is trying to recruit a million Americans to glean hints about health and disease from their data), The BRAIN Initiative (to map the neural circuits and understand the mechanics of thought and memory) and the International Human Cell Atlas Initiative (to identify and describe all human cell types).

"It's not just that any one data repository is growing exponentially, the number of data repositories is growing exponentially," said Dr. Atul Butte, who leads the Institute for Computational Health Sciences at the University of California, San Francisco.

One of the most remarkable efforts is the federal government's push to get doctors and hospitals to put medical records in digital form. That shift to electronic records is costing billions of dollars — including more than $28 billion alone in federal incentives to hospitals, doctors and others to adopt them. The investment is creating a vast data repository that could potentially be mined for clues about health and disease, the way websites and merchants gather data about you to personalize the online ads you see and for other commercial purposes.

But, unlike the data scientists at Google and Facebook, medical researchers have done almost nothing as yet to systematically analyze the information in these records, Butte said. "As a country, I think we're investing close to zero analyzing any of that data," he said.

Prospecting for hints about health and disease isn't going to be easy. The raw data aren't very robust and reliable. Electronic medical records are often kept in databases that aren't compatible with one another, at least without a struggle. Some of the potentially revealing details are also kept as free-form notes, which can be hard to extract and interpret. Errors commonly creep into these records.

And data culled from scientific studies aren't entirely trustworthy, either.

"So many articles that are published today are going to be wrong in 10 years," said Greg Simon, who leads the Cancer Moonshot. "That's just the history of scientific research, and the question is you just don't know which ones are going to be wrong."

Scientists trying to figure out how to analyze that flood of big data are going to have to cut through the dissonance to find a melody. That takes skill.

"In a world when anything is possible because you have so much data, how do you figure out who has done the math right?" asked Food and Drug Administration Commissioner Robert Califf.

He said the only way to know for sure is to take ideas gleaned from the big datasets and then try them out in people. That means persuading patients to participate in studies.

Just a small percentage do today, "and what we're seeing in our best academic centers, the clinicians say they don't have time to talk to patients about participating in studies," Califf said. "So, far and away this is our No. 1 issue that we're focused on with big data."

These problems aren't just abstractions for Sonia Vallabh. Her mother died of a rare, fatal genetic disease in middle age, called prion disease. Vallabh carries the same mutation that afflicted her mother. Vallabh quit her job as a lawyer and is now seeking a doctorate in biological and biomedical sciences at the Broad Institute in Cambridge, Mass.

Vallabh turned to a huge data set of genetic information to see what she could learn about her condition. "It basically confirmed what we thought we knew about my genetic mutation, which is it makes me almost 100 percent likely to die this way by midlife," she said.

But the data also yielded a surprise. Her disease is caused by having too much of a certain protein in her body. And some people with only half as much of this dangerous protein didn't get sick and die.

"So, here's an experiment of nature handed to us on a platter by big data, that says if we can find a way to turn down this disease protein, this protein that wants to kill me, that should be a safe way to delay or prevent disease."

But that's not a question to be answered through data-crunching. Vallabh needs the old-fashioned kind of medical research — laboratory and clinical science — to develop a drug that would reduce the protein safely and effectively.

You can email Richard Harris at rharris@npr.org.

Elections 2026

The Bay

Emma’s Must-Sees

Videos from KQED Live

Donor-Advised Funds

Explosion of Big Data, But Scientists Can't Keep Up