Tuesday, April 10, 2018

Should the 5% Convention for Statistical Significance be Dramatically Lower?

For the uninitiated, the idea of "statistical significance" may seem drier than desert sand. But it's how research in the social sciences and medicine decides what findings are worth paying attention to as plausible true--or not. For that reason, it matters quite a bit. Here, I'll sketch a quick overview for beginners of what statistical significance means, and why there is controversy among statisticians and researchers over what research results should be regarded as meaningful or new.

To gain some intuition , consider an experiment to decide whether a coin is equally balanced, or whether it is weighted toward coming up "heads." You toss the coin once, and it comes up heads. Does this result prove, in a statistical sense, that the coin is unfair? Obviously not. Even a  fair coin will come up heads half the time, after all. 

You toss the coin again, and it comes up "heads" again. Do two heads in a row prove that the coin is unfair? Not really. After all, if you toss a fair coin twice in a row, there are four possibilities: HH, HT, TH, TT. Thus, two heads will happen one-fourth of the time with a fair coin, just by chance.

What about three heads in a row? Or four or five or six or more? You can never completely rule out the possibility that a string of heads, even a long string of heads, could happen entirely by chance. But as you get more and more heads in a row, a finding that is all heads, or mostly heads, becomes increasingly unlikely. At some point, it becomes very unlikely indeed.  

Thus, a researcher must make a decision. At what point are the results sufficiently unlikely to have happened by chance, so that we can declare that the results are meaningful?  The conventional answer is that if the observed result had a 5% probability or less of happening by chance, then it is judged to be "statistically significant." Of course, real-world questions of whether a certain intervention in a school will raise test scores, or whether a certain drug will help treat a medical condition, are a lot more complicated to analyze than coin flips. Thus, so practical researchers spend a lot of time trying to figure out whether a given result is "statistically significant" or not.

Several questions arise here.

1) Why 5%? Why not 10%? Or 1%? The short answer is "tradition." A couple of year ago, the American Statistical Association put together a panel to reconsider the 5% standard. The

Ronald L. Wasserstein and Nicole A. Lazar wrote a short article :"The ASA's Statement on p-Values: Context, Process, and Purpose," in  The American Statistician  (2016, 70:2, pp. 129-132.) (A p-value is an algebraic way of referring to the standard for statistical significance.) They started with this anecdote:
"In February 2014, George Cobb, Professor Emeritus of Mathematics and Statistics at Mount Holyoke College, posed these questions to an ASA discussion forum:
Q:Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.
Q:Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.
Cobb’s concern was a long-worrisome circularity in the sociology of science based on the use of bright lines such as p<0.05: “We teach it because it’s what we do; we do it because it’s what
we teach.”

But that said, there's nothing magic about the 5% threshold. It's fairly common for academic papers to report the results that are statistically signification using a threshold of 10%, or 1%. Confidence in a statistical result isn't a binary, yes-or-no situation, but rather a continuum. 

2) There's a difference between statistical confidence in a result, and the size of the effect in the study.  As a hypothetical example, imagine a study which says that if math teachers used a certain curriculum, learning in math would rise by 40%. However, the study included only 20 students.

In a strict statistical sense, the result may not be statistically significant, in the sense that with a fairly small number of students, and the complexities of looking at other factors that might have affected the results, it could have happened by chance. (This is similar the problem that if you flip a coin only two or three times, you don't have enough information to state with statistical confidence whether it is a fair coin or not.) But it would seem peculiar to ignore a result that shows a large effect. A more natural response might be to design a bigger study with more students, and see if the large effects hold up and are statistically significant in a bigger study.

Conversely, one can imagine a hypothetical study which uses results from 100,000 students, and finds that if math teachers use a certain curriculum, learning in math would rise by 4%. Let's say that the researcher can show that the effect is statistically significant at the 5% level--that is, there is less than a 5% chance that this rise in math performance happened by chance. It's still true that the rise is fairly small in size. 

In other words, it can sometimes be more encouraging to discover a large result in which you do not have full statistical confidence than to discover a small result in which you do have statistical confidence.

3) When a researcher knows that 5% is going to be the dividing line between a result being treated as meaningful or not meaningful, it becomes very tempting to fiddle around with the calculations (whether explicitly or implicitly) until you get a result that seems to be statistically significant.

As an example, imagine a study that considers whether early childhood education has positive effects on outcomes later in life. Any researcher doing such a study will be faced with a number of choices. Not all early childhood education programs are the same, so one may want to adjust for factors like the teacher-student ratio, training received by students, amount spent per student, whether the program included meals, home visits, and other factors. Not all children are the same, so one may want to look at factors like family structure, health,  gender, siblings, neighborhood, and other factors. Not all later life outcomes are the same, so one may want to look at test scores, grades, high school graduation rates, college attendance, criminal behavior, teen pregnancy, and employment and wages later in life.

But a problem arises here. If a research hunts through all the possible factors, and all the possible combinations of all the possible factors, there are literally scores or hundreds of possible connections. Just by blind chance, some of these connections will appear to be statistically significant. It's similar to the situation where you do 1,000 repetitions of flipping a coin 10 times. In those 1,000 repetitions, at least a few times heads is likely to come up 8 or 9 times out of 10 tosses. But that doesn't prove the coin is unfair! It just proves you tried over and over until you got a specific result.

Modern researchers are very aware of the dangers that when you hunting through lots of  possibilities, then just by chance, a random scattering of the results will appear to be statistically significant. Nonetheless, there are some tell-tale signs that this research strategy of hunting to find a result that looks statistically meaningful may be all too common. For example, one warning sign is when other researchers try to replicate the result using different data or statistical methods, but fail to do so. If a result only appeared statistically significant by random chance in the first place, it's likely not to appear at all in follow-up research.

Another warning sign is that when you look at a bunch of published studies in a certain area (like how to improve test scores, how a minimum wage affects employment, or whether a drug helps with a certain medical condition), you keep seeing that the finding is statistically significant at almost exactly the 5% level, or just a little less. In a large group of unbiased studies, one would expect to see the statistical significance of the results scattered all over the place: some 1%, 2-3%, 5-6%, 7-8%, and higher levels. When all the published results are bunched right around 5%, it make one suspicious that the researchers have put their thumb on the scales in some way to get a result that magically meets the conventional 5% threshold. 

The problem that arises is that research results are being reported as meaningful in the sense that they had a 5% or less probability of happening by chance, when in reality, that standard is being evaded by researchers. This problem is severe and common enough that a group of 72 researchers recently wrote: "Redefine statistical significance: We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries," which appeared in Nature Human Behavior (Daniel J. Benjamin et al., January 2018, pp. 6-10). One of the signatories, John P.A. Ioannidis provides a readable over in "Viewpoint: The Proposal to Lower P Value Thresholds to .005" (Journal of the American Medical Association, March 22, 2018, pp. E1-E2). Ioannidis writes: 
"P values and accompanying methods of statistical significance testing are creating challenges in biomedical science and other disciplines. The vast majority (96%) of articles that report P values in the abstract, full text, or both include some values of .05 or less. However, many of the claims that these reports highlight are likely false. Recognizing the major importance of the statistical significance conundrum, the American Statistical Association (ASA) published3 a statement on P values in 2016. The status quo is widely believed to be problematic, but how exactly to fix the problem is far more contentious.  ... Another large coalition of 72 methodologists recently proposed4 a specific, simple move: lowering the routine P value threshold for claiming statistical significance from .05 to .005 for new discoveries. The proposal met with strong endorsement in some circles and concerns in others. P values are misinterpreted, overtrusted, and misused. ... Moving the P value threshold from .05 to .005 will shift about one-third of the statistically significant results of past biomedical literature to the category of just “suggestive.”
This essay is published in a medical journal, and is thus focused on biomedical research. The theme is that a result with 5% significance can be treated as "suggestive," but for a new idea to be accepted, the threshold level of statistical significance should be 0.5%-- that is the probability of the outcome happening by random chance should be 0.5% or less." 

The hope of this proposal is that researchers will design their studies more carefully and use larger sample sizes. Ioannidis writes: "Adopting lower P value thresholds may help promote a reformed research agenda with fewer, larger, and more carefully conceived and designed studieswith sufficient power to pass these more demanding thresholds." Ioannidis is quick to admit that this proposal is imperfect, but argues that it is practical and straightforward--and better than many of the alternatives.

The official "ASA Statement on Statistical Significance and P-Values" which appears with the Wasserstein and Lazar article includes a number of principles worth considering. Here are three of them:
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. ...
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. ...
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Whether you are doing the statistics yourself, or just a consumer of statistical studies produced by others, it's worth being hyper-aware of what "statistical significance" means, and doesn 't mean. 

For those who would like to dig a little deeper, some useful starting points might be the six-paper symposium on "Con out of Economics" in the Spring 2010 issue of the Journal of Economic Perspectives, or the six-paper symposium on "Recent Ideas in Econometrics" in the Spring 2017 issue.