Understanding P-Values

I am currently reading the book Statistics Done Wrong: The Woefully Complete Guide by Alex Reinhart (No Starch Press, 2015). It contains a passage that goes like this:

A 2002 study found that an overwhelming majority of statistics students—and instructors—failed a simple quiz about p values. Try the quiz (slightly adapted for this book) for yourself to see how well you understand what p really means.

Suppose you’re testing two medications, Fixitol and Solvix. You have two treatment groups, one that takes Fixitol and one that takes Solvix, and you measure their performance on some standard task (a fitness test, for instance) afterward. You compare the mean score of each group using a simple significance test, and you obtain p = 0.01, indicating there is a statistically significant difference between means.

Based on this, decide whether each of the following statements is true or false:

  1. You have absolutely disproved the null hypothesis (“There is no difference between means”).
  2. There is a 1% probability that the null hypothesis is true.
  3. You have absolutely proved the alternative hypothesis (“There is a difference between means”).
  4. You can deduce the probability that the alternative hypothesis is true.
  5. You know, if you decide to reject the null hypothesis, the probability that you’re making the wrong decision.
  6. You have a reliable experimental finding, in the sense that if your experiment were repeated many times, you would obtain a significant result in 99% of the trials.

I got one of the answers wrong, and this post explains why.

Now, to get that out of the way, what is a p-value? P-values occur in statistical hypothesis testing. In statistical hypothesis testing, you want to know whether something you’re doing has an effect or not. In our example, we wanted to know if the effects of the two drugs were different. You then formulate the hypothesis of the null result: “There really is no difference between the two drugs”, or “what I’m doing has no effect”. That is the null hypothesis. The alternate hypothesis is “There is a difference between the drugs”, or “what I’m doing actually has an effect”.

You then obtain a test statistic. In our example, we measured the performance of the two groups and the test statistic was the difference of the mean scores. Finally, you put this test statistic (and other information, such as the number of people in the two groups) into a magic box. This magic box gives you the probability that the test statistic is as far away from “zero effect” as observed, or further, if there really is no effect. That is the p-value. It means that if the null hypothesis is true, and if I repeat the experiment many times, the fraction of experiments in which the test statistic will be this large or larger is p.

From this definition, we can immediately conclude that statements 1 and 3 must be false. No matter how small the p-value is (as long as it’s not precisely zero), if the null hypothesis is in fact true, a p-value as small as this or smaller will occur on average every 1/p repetitions of the experiment. That is merely a restatement of the definition. A small p-value, no matter how small can neither conclusively prove or disprove any hypothesis.

Statement 6 is equally easy to disprove. We do not know whether the null hypothesis is true or not, therefore we don’t know what would happen if we repeated the experiment many times. Perhaps the null hypothesis is true and our p-value of 0.01 was one of those “one in a hundred” occurrences. Or maybe the null hypothesis is false, and our low p-value is then a consequence of a true difference in means. We don’t know.

Statements 2 and 4 must also be wrong, although for a different reason. Although it seems intuitive to speak about “the probability that the null hypothesis is true or false”, this statement, at least as given here, makes little sense. Normally, the probability of an event is roughly the fraction of times that event occurs when some experiment is repeated many times. But in order to speak of the “probability that the null hypothesis is true”, the experiment to be repeated would somehow have to switch between the null hypothesis being true one time and false another. But surely, either there is a difference between the two drugs or there is not, so we can’t imagine a universe where there is a difference one time and no difference another time. Posed like this, the question “what is the probability that the null hypothesis is true” makes no sense!

(For my Bayesian friends out there: I realise that this is a frequentist argument. But p-values are frequentist instruments, so that is not unfair.)

So we’re stuck with statement number 5. If I’m supposed to be making the wrong decision, it must mean that the null hypothesis is true (otherwise rejecting the null hypothesis is not wrong), and I’m rejecting it because my p-value is lower than some threshold. In this case I know exactly the probability that I’ll be making the wrong decision: it’s p. This is precisely what the p-value means, it’s just a restatement of the definition.

My conclusion therefore: statement 5 is true and all others are false.

So imagine my dismay when the solution to the quiz read:

I hope you’ve concluded that every statement is false. The first five statements ignore the base rate, while the last question is asking about the power of the experiment, not its p-value.

So according to the author, I had statement 5 mistakenly marked as correct. Why was that? Just today, as I was rereading that passage, I suddenly understood that I had probably misunderstood the statement!

I thought that the statement meant “if you make an experiment and the null hypothesis is true, do you know the probability that the p-value will be as low as it was observed, or lower?” In this case, the answer really is p. But the question could also be formulated differently: “if you make an experiment and you don’t know whether the null hypothesis is true or not, do you know the probability that the p-value will be as low as it was observed, or lower?“.

In this case, the answer is no. If the null hypothesis is in fact false, the probability of making a mistake in rejecting it is zero. If the null hypothesis is in fact true, the probability of making a mistake in rejecting it is p. We don’t know whether the null hypothesis is true or not and since the two probabilities are different, we don’t know the correct probability.

We could say “the probability of wrongly rejecting the null hypothesis is at most p”, but, alas, that was not the question.