I am currently reading the book *Statistics Done Wrong: The Woefully
Complete Guide* by Alex Reinhart (No Starch Press, 2015). It
contains a passage that goes like this:

A 2002 study found that an overwhelming majority of statistics students—and instructors—failed a simple quiz about p values. Try the quiz (slightly adapted for this book) for yourself to see how well you understand what p really means.

Suppose you’re testing two medications, Fixitol and Solvix. You have two treatment groups, one that takes Fixitol and one that takes Solvix, and you measure their performance on some standard task (a fitness test, for instance) afterward. You compare the mean score of each group using a simple significance test, and you obtain p = 0.01, indicating there is a statistically significant difference between means.

Based on this, decide whether each of the following statements is true or false:

- You have absolutely disproved the null hypothesis (“There is no difference between means”).
- There is a 1% probability that the null hypothesis is true.
- You have absolutely proved the alternative hypothesis (“There
isa difference between means”).- You can deduce the probability that the alternative hypothesis is true.
- You know, if you decide to reject the null hypothesis, the probability that you’re making the wrong decision.
- You have a reliable experimental finding, in the sense that if your experiment were repeated many times, you would obtain a significant result in 99% of the trials.

I got one of the answers wrong, and this post explains why.

Now, to get that out of the way, what is a p-value? P-values occur in
statistical hypothesis testing. In statistical hypothesis testing, you
want to know whether something you’re doing has an effect or not. In
our example, we wanted to know if the effects of the two drugs were
different. You then formulate the hypothesis of the null result:
“There really is no difference between the two drugs”, or “what I’m
doing has no effect”. That is the *null hypothesis*. The *alternate
hypothesis* is “There is a difference between the drugs”, or “what I’m
doing actually has an effect”.

You then obtain a *test statistic*. In our example, we measured the
performance of the two groups and the test statistic was the
difference of the mean scores. Finally, you put this test statistic
(and other information, such as the number of people in the two
groups) into a magic box. This magic box gives you *the probability
that the test statistic is as far away from “zero effect” as observed,
or further, if there really is no effect*. That is the p-value. It
means that *if the null hypothesis is true, and if I repeat the
experiment many times, the fraction of experiments in which the test
statistic will be this large or larger is p*.

From this definition, we can immediately conclude that statements 1
and 3 must be false. No matter how small the p-value is (as long as
it’s not precisely zero), if the null hypothesis is in fact true, a
p-value as small as this or smaller will occur on average every 1/p
repetitions of the experiment. That is merely a restatement of the
definition. A small p-value, no matter *how* small can neither
conclusively prove or disprove any hypothesis.

Statement 6 is equally easy to disprove. We do not know whether the null hypothesis is true or not, therefore we don’t know what would happen if we repeated the experiment many times. Perhaps the null hypothesis is true and our p-value of 0.01 was one of those “one in a hundred” occurrences. Or maybe the null hypothesis is false, and our low p-value is then a consequence of a true difference in means. We don’t know.

Statements 2 and 4 must also be wrong, although for a different reason. Although it seems intuitive to speak about “the probability that the null hypothesis is true or false”, this statement, at least as given here, makes little sense. Normally, the probability of an event is roughly the fraction of times that event occurs when some experiment is repeated many times. But in order to speak of the “probability that the null hypothesis is true”, the experiment to be repeated would somehow have to switch between the null hypothesis being true one time and false another. But surely, either there is a difference between the two drugs or there is not, so we can’t imagine a universe where there is a difference one time and no difference another time. Posed like this, the question “what is the probability that the null hypothesis is true” makes no sense!

(For my Bayesian friends out there: I realise that this is a frequentist argument. But p-values are frequentist instruments, so that is not unfair.)

So we’re stuck with statement number 5. If I’m supposed to be making
the wrong decision, it must mean that the null hypothesis is true
(otherwise rejecting the null hypothesis is not wrong), and I’m
rejecting it because my p-value is lower than some threshold. In this
case I know *exactly* the probability that I’ll be making the wrong
decision: it’s p. This is precisely what the p-value means, it’s just
a restatement of the definition.

My conclusion therefore: statement 5 is true and all others are false.

So imagine my dismay when the solution to the quiz read:

I hope you’ve concluded that

everystatement is false. The first five statements ignore the base rate, while the last question is asking about thepowerof the experiment, not its p-value.

So according to the author, I had statement 5 mistakenly marked as correct. Why was that? Just today, as I was rereading that passage, I suddenly understood that I had probably misunderstood the statement!

I thought that the statement meant “if you make an experiment and the
null hypothesis is true, do you know the probability that the p-value
will be as low as it was observed, or lower?” In this case, the answer
really is p. But the question could also be formulated differently:
“if you make an experiment *and you don’t know whether the null
hypothesis is true or not*, do you know the probability that the
p-value will be as low as it was observed, or lower?“.

In this case, the answer is no. If the null hypothesis is in fact false, the probability of making a mistake in rejecting it is zero. If the null hypothesis is in fact true, the probability of making a mistake in rejecting it is p. We don’t know whether the null hypothesis is true or not and since the two probabilities are different, we don’t know the correct probability.

We *could* say “the probability of wrongly rejecting the null
hypothesis is *at most* p”, but, alas, that was not the question.