The Case Against 'Statistical Significance' in Scientific Research

A recent study that questioned the healthfulness of eggs raised a perpetual question: Why do studies, as has been the case with health research involving eggs, so often flip-flop from one answer to another?

The truth isn’t changing all the time. But one reason for the fluctuations is that scientists have a hard time handling the uncertainty that’s inherent in all studies. There’s a new push to address this shortcoming in a widely used – and abused – scientific method.

Scientists and statisticians are putting forth a bold idea: Ban the very concept of “statistical significance.”

We hear that phrase all the time in relation to scientific studies. Critics, who are numerous, say that declaring a result to be statistically significant or not essentially forces complicated questions to be answered as true or false.

“The world is much more uncertain than that,” says Nicole Lazar, a professor of statistics at the University of Georgia. She is involved in the latest push to ban the use of the term “statistical significance.”

An entire issue of the journal The American Statistician is devoted to this question, with 43 articles and a 17,500-word editorial that Lazar co-authored.

Some of the scientists involved in that effort also wrote a more digestible commentary that appears in Thursday’s issue of Nature. More than 850 scientists and statisticians told the Nature commentary authors they want to endorse this idea.

In the early 20th century, the father of statistics, R.A. Fisher, developed a test of significance. It involves a variable called the p-value, that he intended to be a guide for judging results.

Over the years, scientists have warped that idea beyond all recognition. They’ve created an arbitrary threshold for the p-value, typically 0.05, and they use that to declare whether a scientific result is significant or not.

This shortcut often determines whether studies get published or not, whether scientists get promoted and who gets grant funding.

“It’s really gotten stretched all out of proportion,” says Ron Wasserstein, the executive director of the American Statistical Association. He’s been advocating this change for years and he’s not alone.

“Failure to make these changes are really now starting to have a sustained negative impact on how science is conducted,” he says. “It’s time to start making the changes. It’s time to move on.”

There are many downsides to this, he says. One is that scientists have been known to massage their data to make their results hit this magic threshold. Arguably worse, scientists often find that they can’t publish their interesting (if somewhat ambiguous) results if they aren’t statistically significant. But that information is actually still useful, and advocates say it’s wasteful simply to throw it away.

There are some prominent voices in the world of statistics who reject the call to abolish the term “statistical significance.”

“Nature ought to invite somebody to bring out the weakness and dangers of some of these recommendations,” says Deborah Mayo, a philosopher of science at Virginia Tech.

“Banning the word ‘significance’ may well free researchers from being held accountable when they downplay negative results” and otherwise manipulate their findings, she notes.

“We should be very wary of giving up on something that allows us to hold researchers accountable.”

Her desire to keep “statistical significance” is deeply embedded.

Scientists – like the rest of us – are far more likely to believe that a result is true if it’s statistically significant. Still, Blake McShane, a statistician at the Kellogg School of Management at Northwestern University, says we put far too much faith in the concept.

“All statistics naturally bounce around quite a lot from study to study to study,” McShane says. That’s because there’s lots of variation from one group of people to another, and also because subtle differences in approach can lead to different conclusions.

So, he says, we shouldn’t be at all surprised if a result that’s statistically significant in one study doesn’t meet that threshold in the next.

McShane, who co-authored the Nature commentary, says this phenomenon also partly explains why studies done in one lab are frequently not reproduced in other labs. This is sometimes referred to as the “reproducibility crisis,” when in fact, the apparent conflict between studies may be an artifact of relying on the concept of statistical significance.

But despite these flaws, science embraces statistical significance because it’s a shortcut that provides at least some insight into the strength of an observation.

Journals are reluctant to abandon the concept. “Nature is not seeking to change how it considers statistical analysis in evaluation of papers at this time,” the journal noted in an editorial that accompanies the commentary.

Veronique Kiermer, publisher and executive editor of the PLOS journals, bemoans the over-reliance on statistical significance, but says her journals don’t have the leverage to force a change.

“The problem is that the practice is so ingrained in the research community,” she writes in an email, “that change needs to start there, when hypotheses are formulated, experiments designed and analyzed, and when researchers decide whether to write up and publish their work.”

One problem is what would scientists use instead of statistical significance. The advocates for change say the community can still use the p-value test, but as part of a broader approach to measuring uncertainty.

A bit more humility would also be in order, these advocates for change say.

“Uncertainty is present always,” Wasserstein says. “That’s part of science. So rather than trying to dance around it, we [should] accept it.”

That goes a bit against human nature. After all, we want answers, not more questions.

But McShane says arriving at a yes/no answer about whether to eat eggs is too simplistic. If we step beyond that, we can ask more important questions. How big is the risk? How likely is it to be real? What are the costs and benefits to an individual?

The Case Against 'Statistical Significance' in Scientific Research

Signed up.