Moving Science’s Statistical Goalposts
In 1989, Ralph Rosnow and Robert Rosenthal, two well-respected experts on statistical methods in psychology, wrote the following memorable line: “We want to underscore that, surely, God loves the .06 nearly as much as the .05” (p. 1277).
For researchers in psychology—as well as in the biological and social sciences—this was an amusing statement because .05 is the Holy Grail of statistical significance. It may seem unusual to use religious language when writing about scientific methods, but the metaphor is fitting because, for almost as long as scientists have used statistical methods, achieving a probability of .05 or less (e.g., .04, .027, .004) meant publication, academic success, and another step toward the financial security of tenure. But .06 or even .055 meant nothing. No publication and no progress toward a comfortable retirement.
Rosnow and Rosenthal were arguing that scientists had been overly concerned with a single, arbitrary cut-off score, p < .05, but today their plea sounds a bit antique. In the latest response to the “reproducibility crisis” in psychology (see my December 2015 online column, “Has Science a Problem?”) a group of seventy-two accomplished statisticians, biologists, and social scientists have signed a statement proposing that the criterion be changed from .05 to .005. This may seem like a nerdy technical issue, but the proposed change has profound implications for the progress of science and has ignited a vigorous controversy in the field. But I’m getting ahead of myself. Let’s step back and figure out what this is all about.
The Reverse Logic of Statistical Significance
The idea came from the British biologist and statistician Ronald Fisher, a man Richard Dawkins has called “the greatest biologist since Darwin” (The Edge 2011). Fisher invented many statistical techniques—including the analysis of variance (ANOVA)—and, ever since learning how to use Fisher’s methods, has been the bane of graduate students in biology, psychology, and many other disciplines.
Fisher recognized that you cannot affirm the consequent. Scientists are commonly in the position of wanting to prove that a variable they are interested in causes something to happen. For example, imagine a chemist has identified Compound X, which she believes will promote hair growth in balding humans. She creates a Compound X lotion and a placebo lotion and conducts an experiment on balding volunteers. Lo and behold, the people in her Compound X group grow more hair than those in the placebo group. If her experiment was otherwise well designed and conducted, can she safely conclude Compound X grows hair? Of course not. It might just have been a lucky test, and furthermore no number of positive tests can prove the rule.
Understanding this, Fisher proposed to turn the question around, creating a straw person that could be knocked over with statistics. What if we assume the opposite: Compound X has no effect at all? This idea is what Fisher called the “null hypothesis.” The hypothesis of nothing going on. No effect. Then, he reasoned, we can determine the probability of our experimental findings happening merely by chance. This is where ANOVA and other statistical methods come in. Fisher suggested that statistical tests could be used to estimate the kinds of outcomes that would be expected due to random variations in the data. If the result obtained in an experiment was very unlikely to occur by chance, then the researcher would be justified in rejecting the null hypothesis of no effect. It would be fairly safe to say something real was going on. For example, if our chemist’s statistical analysis found probability of getting the amount of hair growth she saw in the Compound X group to be .04—meaning that by chance alone we could expect the same result in only 4 of 100 similarly run experiments—then it would be reasonable for her to conclude Compound X really worked.
So was born the Null Hypothesis Significance Test (NHST), and before long it was the law of the land. Fisher proposed a probability (p-value) of .05, and it came to pass that researchers in a variety of fields could not expect to get their articles published unless they performed the appropriate statistical tests and found that p was < .05 (a probability of the null hypothesis being true in less than 5 percent of cases). Point zero five became the accepted knife edge of success: p < .055 meant your results were junk, and p < .048 meant you could pop the champagne. P < .05 was not a magical number; it just became the accepted convention—a convention that Rosnow and Rosenthal, as well as others (e.g., Cohen, 1990) have criticized to no avail. It remains a fossilized criterion for separating the wheat from the chaff. But perhaps, not for long.
Strengthening Scientific Evidence
Statistics is one of those fields that has been plagued by a number of dustups over the years, but this latest controversy over p-values is inspired by the reproducibility crisis, the discovery that many classic experiments—primarily in social and cognitive psychology—could not be reproduced when attempted again. The resulting loss of confidence in research findings has spawned a number of reforms, most notably the Open Science movement, about which I wrote in my December 2016 online column, “The Parable of the Power Pose and How to Reverse It.” The open science approach makes research a much more public and collaborative enterprise and makes the common practice of “p-hacking”—fiddling with your data until something significant pops out—much more difficult.
Then, in July of 2017, seventy-two scientists (Benjamin, et al. 2017) wrote the proposal to simply make the criterion for significance more difficult by moving the cut-off from 5 out of 100 to 5 out of 1000 (p < .005). This would greatly help one of the issues the authors presume to be a cause of the reproducibility problem: Type I error. When we say something is statistically significant, we are simply saying that it is unlikely the observed effect is due to chance. But it’s not impossible. By definition the choice of the .05 significance level means that we are willing to live with a 5 percent chance of an error—saying there is an effect when actually our results were caused by normal random variation. This is a kind of false positive result. For example, if Compound X is useless, our choice of the .05 criterion means that in five tests out of hundred, we would conclude something was happening when nothing was. And when, as is most often the case, we conduct just a single test, how do we know whether ours is one of the five random cases that just happened to make Compound X look like it works? Based on a single test, we really can’t say.
Moving the statistical criterion to .005 would make the chances of a Type I error much lower, which means that the findings that reach publication would be more trustworthy and more likely reproducible. This is a worthy goal, and there is no easier fix than simply requiring a stronger test. But such a change would not be without costs. When you reduce the chances of a Type I error, you increase the chances of a different kind of error with the clever name of Type II. Type II errors are caused when the effect you are studying is real but your test fails to show it. Compound X really can grow hair, but by chance, your test came up nonsignificant: p > .05. A kind of false negative result. The change to .005 proposed by the gang of seventy-two would make false negatives more common, and given the often enormous time and cost involved in modern research, it would mean that important results that would contribute to our knowledge base are likely to die on the vine. Ultimately the progress of science would be slowed.
Perhaps anticipating this objection, the seventy-two authors propose that results falling between the traditional .05 and the new .005 might still be published as “suggestive” rather than statistically significant. Furthermore, the < .005 criteria would only be applied to tests of new phenomena, and replications of previously published studies could remain at .05. But it is clear the change would still have a powerful impact. As someone on one of my Internet discussion groups suggested, this change might make many psychology journals much thinner than they are now. I haven’t checked, but I am fairly certain several of my own published studies would have to be demoted to the “suggestive” category.
The proposal by the seventy-two researchers is scheduled to appear in a future issue of the journal Nature Human Behavior, but the prepublication copy posted online has already attracted much publicity and considerable comment from the scientific community. Many researchers have welcomed the .005 suggestion, which has been offered by others in the past (Resnick 2017), but there has also been some pushback. Psychologist Daniel Lakens is organizing a group rebuttal. According to an article in Vox, one of Laken’s primary objections is that such a technique will slow the progress of science. It may strengthen the published research literature at the cost of discouraging graduate students and other investigators who have limited resources.
Typically, the most effective way to increase the power of a statistical test and the likelihood of reaching a stringent significance level, such as < .005, is to increase the number of participants in your study. Although the Internet provides new opportunities for collecting large amounts of survey data, consider the difficulties of developmental psychologists who study infant behavior in the laboratory. As important as this work is, it might end up being limited to a few well-funded centers. Psychologist Timothy Bates, writing in Medium, makes the more general argument that a cost-benefit analysis of the switch to .005 comes up short. Research will become much more expensive without, in his view, yielding equal benefit (Bates 2017).
Finally, there is the question of placing too much emphasis on one issue. John Ioannidis is a Stanford University statistician and health researcher whose 2005 paper “Why Most Published Research Findings are False” is a classic document of the reproducibility movement. He is also one of the seventy-two signers of the .005 proposal. But Ioannidis admits that statistical significance is not the only way to judge a study: “statistical significance [alone] doesn’t convey much about the meaning, the importance, the clinical value, utility [of research]” (quoted in Resnick 2017). Even if Compound X produced a significant increase in hair growth (p < .005), the hair growth in question might not be noticeable enough to make the treatment worth using. Under the proposed, more stringent statistical criteria, researchers might concentrate on getting results that are likely to get them over the statistical goal line and pass up more meaningful topics.
Will It Happen?
So how likely is it that the goalposts will get moved and journals will begin to require a p < .005 level of significance? I suspect the probability is low. Not five chances out of a thousand, but less than 50:50. There are good reasons for making this correction, particularly at this time when confidence in social science research is at a low point. But, for a couple of reasons, I don’t think it will happen.
First, because this controversy is relatively new, the objectors are still developing their responses. As the conversation gets going, I suspect we will hear more concerns about Type II errors—real phenomena that will be missed because they fail to meet the p < .005 criterion—and about discouraging young investigators from going into research. No one is in favor of that.
But I suspect one of the biggest sources of resistance will be based in economics rather than in arcane technical and professional issues. It turns out that academic publishing is enormously profitable—a $19 billion worldwide business that, according to The Guardian, is far more profitable than either the film or music industries (Buranyi 2017). A large portion of this success comes from a unique business model in which the product being sold—academic scholarship—is obtained essentially for free. Research that often costs millions of dollars in research grants and salaries to produce is handed to publishers such as Elsevier or Springer for nothing. Even peer reviews of submitted manuscripts are done by researchers who donate their time for free.
Academic publishing is a kind of crazy incestuous feedback loop. Researchers must get their work published in high quality journals if they wish to advance their careers, and scholars and university libraries must pay exorbitant subscription costs to gain access to the same journals. The current system is being threatened by Sci-Hub, a pirated archive of scientific publications, and by a growing number of scholars who are posting prepublication versions of their work online (Rathi 2017). Eventually, all scientific publishing may be entirely free and open, but until that day, the publishing industry will continue to be hungry for material. As a result, the prospect of much less content and thinner journals will not be a welcome thought, and I suspect this attitude will be communicated down the line to editors who must choose whether to adopt the new standard or not. The publishing gods love .05 much more than .005.
- Bates, Timothy. 2017. "Changing the default p-value threshold for statistical significance ought not be done, and is the the least of our problems" Medium. July 23. Accessed August 13, 2017. Available online at https://medium.com/@timothycbates/changing-the-default-p-value-threshold-for-statistical-significance-ought-not-be-done-in-isolation-3a7ab357b5c1.
- Benjamin, Daniel J, James Berger, Magnus Johannesson, Brian A Nosek, Eric-Jan Wagenmakers, Richard Berk, Kenneth Bollen, et al. 2017. “Redefine Statistical Significance”. PsyArXiv. July 22. Available online at osf.io/preprints/psyarxiv/mky9j
- Buranyi, Stephen. 2017 "Is the staggeringly profitable business of scientific publishing bad for science?" The Guardian. June 27. Accessed August 14, 2017. Available online at https://www.theguardian.com/science/2017/jun/27/profitable-business-scientific-publishing-bad-for-science.
- Cohen, Jacob. 1990. "Things I have learned (so far)." American psychologist 45, no. 12:1304.
- The Edge. 2011. “Who is the greatest biologist of all time?” Edge.org. March 11. Accessed August 13, 2017. Available online at https://www.edge.org/conversation/who-is-the-greatest-biologist-of-all-time.
- Ioannidis, John PA. 2005. "Why most published research findings are false." PLoS medicine 2, no. 8: e124. doi:10.1371/journal.pmed.0020124.
- Rathi, Akshat. 2017. "Soon, nobody will read academic journals illegally - the studies worth reading will be free." Quartz. August 09. Accessed August 14, 2017. Available online at https://qz.com/1049870/half-the-time-unpaywall-users-search-for-articles-that-are-legally-free-to-access/.
- Resnick, Brian. 2017. "What a nerdy debate about p-values shows about science - and how to fix it." Vox. July 31. Accessed August 13, 2017. Available online at https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005.
- Rosnow, Ralph L., and Robert Rosenthal. 1989. "Statistical procedures and the justification of knowledge in psychological science." American Psychologist 44, no. 10: 1276.
Vyse, Stuart. 2016. "The Parable of the Power Pose and How to Reverse It." CSI. December 15. Accessed August 13, 2017. Available online at http://www.csicop.org/specialarticles/show/the_parable_of_the_power_pose
- Vyse, Stuart. 2015. "Has Science a Problem?" CSI. June 18. Accessed August 13, 2017. Available online at http://www.csicop.org/specialarticles/show/has_science_a_problem.