General Statistics Part 2: The Use and Misuse of P-Values

Updated: Jul 22, 2020

Introduction

P-values are one of the most well-known and contentious statistical values in scientific research

Inferential statistics relies on significance testing methods like p-value calculations. P-values answer the question of how rare your data is by telling you the probability of getting data that is at least as extreme as the data you observed if the null hypothesis were true.

Researchers Sullivan and Feinn found that the number of studies that contain p-values in their abstracts has doubled from 1990-2014, 96% were below .05, the common alpha (or false positive rate) cutoff value for statistical significance as discussed in Part 1. Many researchers believe its role in defining valid scientific findings has been overemphasized.

With articles like this hot off the press, it’s clear that it would be useful to break down how a p-value is used, why it’s helpful, and how it is misused.

What is a P-Value?

Being a probability value, the p-value will always be a value between 0 and 1.

The American Statistical Association provides this definition of the p-value:

Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value.

In null hypothesis significance testing, this “specified model” is the null hypothesis. Recall from Part 1 that in null hypothesis testing, you try to discredit the hypothesis by assuming the null distribution is the real distribution the sample was taken from.

In other words, the p-value measures how compatible your data is with the null hypothesis. For instance, suppose that a study involving a cancer immunotherapy drug produced a p-value of 0.02. This would indicate that if the drug had no effect, the researchers would have obtained the observed difference or more in 2% of studies due to random sampling error.

Going back to the coin toss example from Part 1, if this coin is tossed 100 times, you would expect about 50% of the time for the coin to land on heads. However, if we get heads 90% of the time, we would suspect a false coin or that bias is being introduced into the experiment. We can compute the p-value therefore as the probability of heads facing up 90 times (what we observed) or more under the assumption that the truth is 50%.

The p-value tells you how extreme your data would be if you assume your null hypothesis is true. Adapted from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5042133/.

In one-sided p-value cases like the above, we are told the probability of getting a sample mean that is higher than the value we observed. One-tailed p-values are useful when the researcher has an idea of which group will have a larger mean even before collecting the data. Oftentimes in research we don’t know which direction the effect will be in, so it is common to use two-sided p-values, which tells us how far away a value is from the mean regardless of whether it is higher or lower.

The Alpha Cutoff and Rejecting the Null Hypothesis

The alpha value is the cutoff for the p-value. With the common alpha of 0.05 in research, p-values less than this value are considered statistically significant. However, this by itself does not explain the importance of a study’s outcome, since it does not include information such as the size of the effect.

A cutoff of 0.05 means that we want our sample value to be at least in the top 5% of the most extreme values in our distribution before we consider the value evidence against the null hypothesis. As such, a cutoff at 0.05 means you will mistakenly reject the null hypothesis 5% of the time when it is in fact true. Some scientists have argued that researchers can employ different p-value cutoffs, different alphas, depending on the situation, so long as they justify the reason for the cutoff.

P-Value Misuses

The probability of the null being true or false is not encoded in the information the p-value represents

If our p-value isn't lower than alpha (that is, a non-significant p-value), we fail to reject the null. We do not accept, or provide evidence that its true, we've only failed to provide evidence that it is false. Going back to the immunotherapy drug example, if a random sample that took the drug had 20% better efficacy than the random sample that took the control drug, a p-value of 0.02 does not mean that there is a 2% chance that the null hypothesis (that there really is no difference between the groups), is true. This mistake is known as the inverse probability fallacy. What we’re asking for here is the probability of the null being true given the data we observed, but the p-value answers the question what is the probability of getting this data given the null hypothesis is true. It’s a bit like assuming the probability of being rich given one is famous is the same as the probability of being famous given one is rich. In short, in the p-value calculation, the null hypothesis is already assumed true, so differences between the groups are accounted for by random sampling variation. If you wanted to try to assign probability to the null hypothesis, you would have to go to Bayes and arm yourself with prior probabilities (more on that later). This one is summed up pretty well by “absence of evidence is not evidence of absence”.

Additionally, p-values can't tell you the probability you’ve made an error, given that you have rejected the null. However, measurement error and other irrelevant effects commonly make their way into scientific findings and skew p-values. We want to maximize the impact of relevant effects on the p-value significance. In any study where the true effect deviates slightly from zero due to some systematic noise, a significant effect can be found for a large enough sample size. This amplification of irrelevant effects into a significant difference is one source of false findings in the sciences, especially with more easily accessible big data.

Recall that the alternative hypothesis from Part 1 is the mathematical opposite of the null hypothesis. When you reject the null, you still don’t have much information about the alternative. In our coin toss example, if the null hypothesis is simply that our coin is unbiased and should produce heads 50% of the time, then when we reject this null, the alternative hypothesis is just stating that our coin is biased and does not produce heads 50% of the time. This underscores the importance of stating specific and clear hypotheses. Even better, if a researcher is able to state specifically the range they expect their effect size to be within, the theory has much higher predictive value. You get more credit in darts when you correctly predict you will hit the bullseye than when you correctly predict you will hit the wall.

Null hypothesis significance testing also sets up a binary situation where we have only the null hypothesis and the alternative hypothesis, but there very well may be other hypotheses that we have excluded by accident or on purpose through our study design. In this case, refuting the null is akin to killing a paper tiger or tearing down a strawman. It only showed that the observed data would be improbable if an (already) improbable hypothesis were true in the first place.

In short, p-values may be useful to seeing how well the data fits the null hypothesis and showing differences in effects in well-designed experiments, but they do not reveal the truthhood of the null hypothesis, how robust findings are to omitted variables, mis-specification of the model, or power against alternatives.

Final Thoughts

So we clearly shouldn’t be putting all of our interpretive eggs in this one p-value basket, so much so that researchers might be skewing their experiments to get lower p-values, called p-hacking (by say, deleting outlier points that raise the p-value, or choosing analysis pipelines that are likely to lead to significant results). The way forward towards a more truthful and impactful scientific ethos is embarking on solid scientific reasoning and investigation and attempting to prove oneself wrong.

The next post, Part 3, will detail out how effect size and power play such an important role in determining probable scientific truths and what researchers can do to increase the odds that their experimental designs will yield truer findings.