This post is kind of (p)-hacky.

What is a p-value? It’s what we use to say “my experiment isn’t full of crap”. Specifically, if I want to prove a drug is effective, then I take my data, compute a p-value, and if it’s below some acceptable value (say, 0.05) I get to say that my drug was effective. Similarly, I can do the same thing in A/B testing for some software feature on an app. All in all, the smaller the p-value, the better.

Now, let’s say that I’m a poor graduate student, or a desperate young researcher, or a biomedical company that has sunk far too much into a drug development project. Now, I can only move forward with my results if the p-value is small enough. So, I decide to hedge my bets and try multiple things at once. Maybe instead of testing one feature, I test 10 simultaneously, and publish my results on whatever feature landed me that beautifully small p-value. Huzzah, I’ve gamed the system!

But is this a good idea? For instance, in the drug example, suppose I don’t really know what the pill does, so I test it on five different diseases all at once, and put it out for the disease of which it showed the most effect. Seems logical. But now, I’m not checking to see if $p < 0.05$, but rather if $\min_{i=1,…,5} p_i < 0.05$. What’s wrong with that? Well, frankly, the chances that the minimum of a set of random events is smaller than some threshold, is far larger than the chances of a single random event is smaller than that threshold. To put it mathematically,

$$
\mathrm{Pr}(p<\alpha) \leq \mathrm{Pr}(\min_{i=1,…,m} p_i < \alpha)  \leq \sum_{i=1}^m \mathrm{Pr}(p_i<\alpha).
$$

That is, the chances of a minimum p-value is usually greater than a single p-value, and using a union bound, can be up to $m$ times the “real” p-value.

This is called p-hacking.

Let’s back up and formalize things a little

Most days, I consider myself mathematically literate, but when it comes to statistics I seem to always feel an automatic brain fog. So if you’re anything like me, you’d appreciate a bit more formalism, to try to keep things clear. So let’s start from the beginning.

In data science, we usually start with a null hypothesis ($H_0$), and we try to accept or reject it, in a process called hypothesis testing. For example, the null hypothesis could be that a drug is ineffective, or a particular software change makes no positive impact on the user. Then, I collect some data, and somehow compute the probability of seeing that data (or data that is even more extreme) under the null hypothesis. This probability is the p-value. If the p-value is small, it means that the chances of me seeing this data, under the null hypothesis, is small, and thus my data is evidence to reject the null hypothesis.

We can generalize this concept mathematically with the following property: a p-variable is a random variable (a function of the data) that satisfies a key property:

$$
\mathrm{Pr}_{H_0}(p\leq \alpha) \leq \alpha \tag{Key p-value property}
$$
A p-value is a value of that variable. That is, the p-value should exactly represent the rarity of seeing your data, under the null hypothesis.

(ChatGPT thinks I shouldn’t have used the word rarity there, and prefers I say “extreme”. She (yes I gendered my AI) made the good point that an event can be rare but not extreme, e.g. if I usually wake up around 6AM to 7AM, then the chances of me waking up at 6:12 on the dot is pretty rare, but not that extreme. Still, I think rarity captures more what I’m trying to convey, which is that it is more reject-able. But I could be totally wrong.)

So why is the above example of p-hacking bad? Well, suppose in the extreme that each factor was actually independent. Then by applying DeMorgan’s rule,

\begin{eqnarray*}
\mathrm{Pr}(\min_{i=1,…,m} p_i \leq\alpha)  &=& \mathrm{Pr}( p_1\leq \alpha \text{ or } p_2 \leq \alpha \text{ or } … \text{ or } p_m\leq \alpha)  \\
&=& 1- \mathrm{Pr}( p_1> \alpha \text{ and } p_2 > \alpha \text{ and } … \text{ and } p_m> \alpha)  \\
&=&1-(1-\alpha)^m
\end{eqnarray*}

For fun, let’s actually see how big the numbers can get when you p-hack (see table below).

α ↓ \ m → 3 5 10 20 100
0.01 0.030 0.049 0.096 0.182 0.634
0.05 0.143 0.226 0.401 0.642 0.994
0.10 0.271 0.410 0.651 0.879 0.99997

Yikes! Those numbers are (wrongly) huge!

What are other ways we can p-hack?

Hopefully at this point, I’ve convinced you that p-hacking 1. is a thing and 2. is a bad thing. What are other ways that p-hacking can occur?

 Well, we covered the case of testing for multiple factors at once.

Another is called “peeking”–this is the act of checking the p-values as the data is being processed, and ending the experiment as soon as the p-value is small enough. (This practice is attractive for, say, an expensive and drawn out drug trial.) However, although temporal correlations are stronger than testing multiple factors at once, it also violates the key property, since there will still be \emph{some} increase in the probability of seeing that p-value.

A third is in machine learning, and occurs when we test multiple models and pick the best one for a task. We might think this is mitigated because we all took a basic machine learning course that taught us to use cross validation and to never ever ever ever ever confuse the test set (used for evaluation) with the validation set (used for model selection). However, we should be careful before we pat ourselves on the back, when standardized testbeds like ImageNet and BLEU have been used to develop many models, through contests or simply publication competition.

(This third example also draws an interesting connection with p-hacking and overfitting. Indeed, p-hacking can be viewed like overfitting your experimental design to your data.)

The Bonferroni correction

One (rather extreme) way to mitigate this error is to use a correction term that accounts for the inflation in p-values. In particular, even without assuming independence, we can use a union bound so that

\begin{eqnarray*}
\mathrm{Pr}(\min_{i=1,…,m} p_i \leq\alpha)  &=& \mathrm{Pr}( \bigcup_{i=1}^m p_i\leq \alpha) \leq \sum_{i=1}^m \mathrm{Pr}(p_i\leq \alpha) \leq m\alpha.
\end{eqnarray*}

So, if each $p_i$ is a valid p-variable for that specific experiment, then I can at least assume that $\mathrm{Pr}(\min_{i=1,…,m} p_i \leq \alpha) \leq m\alpha$. So, instead I should use an effective p-value that is m-times that, e.g. $p_{\text{eff}} = m\cdot \displaystyle\min_{i=1,…,m} p_i $. Then,

$$
\mathrm{Pr}(p_{\text{eff}}  \leq\alpha) = \mathrm{Pr}(\min_{i=1,…,m} p_i \leq\frac{\alpha}{m}) \leq \frac{\alpha}{m}\cdot m = \alpha
$$
and is in itself a valid p-variable.

Nice! So we got a way out. Of course, actually achieving that small of a p-value could be onerous for the experimental design, but hey, life is filled with tradeoffs. It is worth saying that the union bound can at times be loose, especially when each experiment is highly correlated (e.g. in the peeking scenario, or if you are testing for factors that are comorbidities in drug trials), but better safe than sorry, right?

Introducing the e-value

Let’s revisit the peeking example, and try to make it “safe”, e.g. preserving the property that p-value is as rare under the null hypothesis as it says it is. If I wanted to use the Bonferroni correction, I could, by, say, agreeing ahead of time to only peek at most $m$ times, and then $p_{\text{eff}} = m p$. But that’s rather limiting — and also rather pessimistic. Instead, I want to have a statistic that says what I want to say, in the peeking example, no matter how many times I peek.

This is the idea behind e-value, which has been around since the 50s, but has recently gained traction by several important papers (see references). In particular, an e-variable is one which obeys a related property

$$
\mathbb E_{H_0}[e]\leq 1 \tag{Key e-variable property}
$$

The idea here is to turn a probability into a bet. Say I design a new drug, and want to prove to the FDA that it’s effective. So, I bet \$100 that my drug is effective. Then, let’s say that my drug is tested under many trials, and at each trial, I bet the \$100, and I either double my money if the drug is successful, or lose it all if the drug fails. Under the null hypothesis, I should have accumulated at most  \$1.

An example of an e-value is a likelihood ratio:

$$
\Lambda_t = \frac{\text{Pr}(X_{1:t}|H_1)}{\text{Pr}(X_{1:t}|H_0)}=\prod_{\tau=1}^t\frac{\text{Pr}(X_{\tau}|H_1)}{\text{Pr}(X_{\tau}|H_0)}
$$

In particular, note that the value of $\Lambda_t$ can be much bigger than 1 if $H_1$ is more likely to be true than $H_0$. However, in expectation under the null,

$$
\mathbb E_{H_0}[\Lambda_t] = \sum_{X_{1:t}} \frac{\text{Pr}(X_{1:t}|H_1)}{\text{Pr}(X_{1:t}|H_0)}\text{Pr}(X_{1:t}|H_0) =\sum_{X_{1:t}} \text{Pr}(X_{1:t}|H_1) = 1
$$

by law of total probability. In this sense, forming an e-value in this way is safe from peaking.

Type I guarantee

A Type I error, or a false positive error, is when you pick $H_1$ when you were supposed to pick $H_0$. In the medical example, it would be deciding a drug is effective, when in actuallity it is ineffective. A type I guarantee is therefore an upper limit on a type I error chance, e.g. Pr(reject $H_0$ | $H_0$ is true) $\leq \alpha$. The e-value gives us this type I guarantee through Markov’s inequality

$$
\text{Pr}_{H_0}(1/E\geq \alpha)  = 1 – \text{Pr}_{H_0}(E\geq 1/\alpha) \geq 1 – \alpha \mathbb E_{H_0}[E]\geq 1-\alpha
$$

This means that if $E$ is a valid e-variable, then $1/E$ is a valid p-variable. However, since Markov’s inequality can be loose in many places, this is quite a conservative bound. Additionally, the other way around doesn’t work; if $P$ is a valid p-variable, $\mathbb E_{H_0}[1/P]$ may be much higher than 1. Specifically, if I ran an experiment and achieved an actual realization of $E = 5$ (e.g., as the value of the likelihood ratio), this corresponds to a $p=0.2$ p-value. And, as the previous example showed, this guarantee is maintained under sequential testing, since the guarantee of $\mathbb E_{H_0}[E_t] \leq 1$ is maintained for all $t$.

An e-value is not interchangeable with a p-value

However, despite this one-way equivalence (and a looser one goes in the other direction too) I think it is not correct to suggest the two measures are equivalent.  This is because of a key difference: a p-value is agnostic to the details of $H_1$, while an e-value profits from taking $H_1$ into account. Specifically (though it is not by definition required) the best choice of e-variable can be argued to be $S(x) = \frac{\text{Pr}(x|H_1)}{\text{Pr}(x|H_0)}$  (exactly the likelihood ratio), under which $\mathbb E_{x\sim H_1}[\ln(S(x))]$ is maximized, under the constraint that $\mathbb E_{x\sim H_0}[S(x)] \leq 1$.

However, we can try to play a game of drawing an equivalence, by thinking of the e-value that would maximize the chance of rejecting the null under the p-test. We can do this by considering two examples offered in Shafer’s paper, illustrated below.

Case A Case B

In both cases, suppose the observation is $x = 10$. In the case of a p-test with $\alpha = 5\%$, in both cases the null hypothesis would be rejected. However, if we are allowed to model an alternative hypothesis $H_1$ as shown, and take it into consideration, the conclusion is less clear; in both cases, $H_1$ is not much better. Note that in the p-test case, $H_1$ just means “not $H_0$”, so it’s not like you’re being misled. But, if you did have an $H_1$ in mind, an e-value would take it into account.

What would an e-value tell you? Well, if we use the likelihood ratio, then in both cases, $S(x)$ would be marginally above 1; it would give you the granularity to say, although $H_1$ is better than $H_0$, it’s not enough to bet all your money on.

My understanding, based on the later section of Shafer’s paper, is that this offers a powerful framework in a sequential testing framework: that you can learn better $H_1$ at the same time, while testing against $H_0$, and that learning an accurate $H_1$ is beneficial to your betting strategy.

However, one thing is true: you can still hack e-values.

Specifically, If we tried to e-hack the same way we p-hacked, we’d still be hacky.Recall our example before, about the multiple observations, all being simultaneously tested. Then, although $\text{Pr}_{H_0}(P_i \leq \alpha) \leq \alpha$ for each variable $P_i$, under independence of $P_i$ then $$\text{Pr}_{H_0}(\min_i \, P_i \leq \alpha)  = 1 – \prod_{i=1}^m (1- \text{Pr}_{H_0}(P_i \geq \alpha) ) = 1-(1-\alpha)^m > \alpha.$$

By the same token,   $\mathbb E_{H_0}[E_i]\leq 1$ for all $i = 1,…,m$ does not imply that  $\mathbb E_{H_0}[\max_i\, E_i]\leq 1$ .

A comment on safety vs objectivity

I found this particular passage in Shafer, appendix kind of interesting. He writes

Are the probabilities tested subjective or objective? The probabilities may represent someone’s opinion, but the hypothesis that they say something true about the world is inherent in the project of testing them.

In a way, this is the key point about a statistic being safe. We are not asking, “Did you set up your experiment so that it is totally blameless?” and thus impose a bunch of guardrails that might needlessly add burden to the experiment (e.g. forcing samples to be totally independent, or prohibiting peaking, or even having the appearance of total impartiality toward the result). We are asking “Given whatever experiment you decided to do, this is how reliable it is.” The question of objectivity vs subjectivity isn’t actually important at all, at the end of the day

Concluding thoughts

To be honest, I haven’t wrapped my head around it enough to really paint a complete picture. (If you are a statistician, I do apologize if I totally botched the subject.) But I find the subject fascinating, an interesting interplay between sequential probabilities, optimality, and trustworthiness — and at least the first part is completely approachable to a general audience. Whether there really is a categorical improvement of using e-values over p-values, only time will tell; they still seem a bit new, despite their close relationships with sequential likelihood ratio tests. But in the meantime, the discussion on this seems quite invigorating!

Till next time, folks!

Dalle generated. Feet missing. (smh)

References 

  • Glenn Shafer (2021). Testing by betting: A strategy for statistical and scientific communication.
  • Vladimir Vovk & Ruodu Wang (2021) E-values: Calibration, Combination and Applications
  • Peter Grünwald, Rianne de Heide & Wouter Koolen (2019). Safe Testing.

 

Leave a Reply