Statistical tests are complicated because their inventors did not have fast computers
Instead of thinking hard about the exact mathematical form of the null hypothesis, just simulate it with Monte Carlo. The bitter lesson applies to classical statistics.
Students of frequentist statistics typically learn about a few statistical tests: their names (usually of the people who invented them), and some idea of what kind of data they apply to. For example:
Student’s t-test applies to continuous measurements and can check whether the average value of some measurement is different in two populations. For example, whether on average sheep from Wales or New Zealand have different heights.
The Kolmogorov-Smirnov test checks whether a set of samples come from a known probability distribution, e.g. whether they are Gaussian. (Or, whether two different sets of samples come from the same distribution).
The technical conditions are sometimes difficult to quickly understand, so often people just learn about specific use cases. For example, all I knew about the Mann-Whitney U-test is that you can use it to check whether your new algorithm or an existing algorithms is better, on a dataset of test cases.1 When writing a paper proposing a new algorithm, you run that and the old algorithm, check whether p<0.05, and you’re done.
All this is to say, tests are complicated. Understanding when and how to apply each test is complicated. The assumptions are complicated, the formulas are complicated, communicating the results in an understandable manner is complicated. If you dig into the papers describing these tests, you’ll find them full of jargon, long derivations, scary-looking probability distributions, proofs that the tests are valid, et c.
One would be tempted to think that it is therefore very difficult to make new tests. If your data don’t fit any named test well, you pick some aspect of it that does and then just run the test.
However, I contend that in many cases, tests are complicated because the people that invented them didn’t have fast computers.2 The idea of a statistical test is very simple and, once you have Monte Carlo methods with huge sample sizes, you don’t need much more than that.
This is the CDF of Student’s t-distribution. William Gosset had to derive it and prove that the difference of standard errors followed it. Looks pretty hard!
What’s in a test?
The ingredients you need to craft a statistical test about some phenomenon are:
A null hypothesis about what is going on, that you will attempt to refute.
A measurable quantity that is different under the null hypothesis and alternative hypotheses.
A version of the null hypothesis that is concrete enough to run as a computer program.
To actually run the test, you:
Measure the quantity from your real data, and obtain some real measurement.
Simulate taking the same measurement from the null hypothesis many times, and obtain a simulated measurement for each simulation.
Apply the definition of p-value: the proportion of simulated measurements that are more extreme than the real measurement is your p-value.
That’s it! With the magic of Monte Carlo methods and a little CPU time, we have computed approximated p-values for your test. It’s even possible to convert this approximate p-value to an exact one using a little slack and Chernoff-Hoeffding bounds (see appendix). Using more simulations decreases the amount of slack needed exponentially.
Example: the computational Student’s t-test
We want to know whether Welsh or Kiwi sheep have the same height at the shoulder.
Let’s collect our ingredients:
The null hypothesis is that Height(Wales) = Height(New Zealand).
We will measure the average height of Welsh and Kiwi sheep and subtract them. That’s our quantity, which is large if the heights are different and close to 0 if they’re the same.
To simulate the null, we assume sheep height is normally distributed with standard deviation S(W) in Wales and S(NZ) in New Zealand3. For simplicity we assume that both are equal to 20cm.
We collect the height of N=30 sheep from each country. We measure 77.2cm for Wales and 82.1cm for New Zealand, for a difference of 4.9cm.
Then, we write a computer program that simulates measuring sheep from the null hypothesis. We draw 30 random normal variables with the same mean, for each of the two countries. We average Kiwi and Welsh sheep separately and then subtract the results, like in the real experiment. We do this 1 million times and count the number of times the difference was larger than 4.9cm. It did so 34.3% of the time, so our p-value is 0.343. Thus we cannot reject the null.
Discussion
We created a rough equivalent to Student’s t-test using computational brute force. We did not have to invent the t-distribution, nor prove that the averages divided by the standard deviation would follow it. We just wrote a short Numpy program.
This really simple test would break if the real standard deviations of sheep height were not 20cm. We could deal with this by using a grid of variances and picking the ‘least convenient’ one every time (the one with highest p-value), but this could be brittle if the grid is too coarse.
Through experimentation, we might find that dividing the means by the observed standard deviation makes the test more robust or powerful. (We can computationally measure how good a test is by computing the false acceptance and false rejection rates over various alternative hypotheses, including those that break the assumptions of the test.)
Conclusion
Many things become easy if you have computational power. The bitter lesson applies to statistics as well as AI.
Code for this article is available here.
Appendix: making the p-value real using Chernoff-Hoeffding bounds
Our Monte Carlo estimator of the p-value is the average of many Bernoulli variables X_i. Because we’re worried about the p-value being falsely too small, we’d like a bound on the probability that the mean exceeds the real Bernoulli parameter plus some safety margin:
Fortunately the Chernoff-Hoeffding theorem tells us the following:
the bound is a constant that depends on epsilon, to the power of N.
How do we incorporate this into a p-value? Well, the p-value is supposed to be the maximum allowed Type-I (false positive) error. Suppose we target a p-value, that is a total Type-I error probability, of alpha. Then we can check whether the average of the Xi is below alpha - epsilon - (probability of bound failure), that is
By increasing N and tuning epsilon we make this threshold easier to achieve. The optimal epsilon for a particular N can be tuned with a grid search. Looks like its optimal value decreases with increasing N.
N=10^1 eps=3.667e-01 f(eps)=0.434623
N=10^2 eps=1.421e-01 f(eps)=0.159724
N=10^3 eps=5.165e-02 f(eps)=0.056468
N=10^4 eps=1.814e-02 f(eps)=0.019526
N=10^5 eps=6.260e-03 f(eps)=0.006655
N=10^6 eps=2.126e-03 f(eps)=0.002245
N=10^7 eps=7.161e-04 f(eps)=0.000751
N=10^8 eps=2.393e-04 f(eps)=0.000250Turns out, it’s similar in use to the Student t-test, but instead of comparing the values themselves it compares their ranking, which lets it make fewer assumptions.
This is of course not strictly true. If your null hypothesis is what’s called a composite hypothesis (that is, a set of probabilistic models, see e.g. this) then the brute-force computational approach I present is more difficult or impossible. This is because you need a way to pick the least convenient possible null at any point, and that can get annoying if your set has too many parameters. In some cases, classic tests even have non-parametric sets, which means they work with huge abstract sets of null hypotheses. These too can be overcome with enough computation but it can get impractical.
Yes, the real t-test doesn’t need to assume a variance. See the discussion.


https://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html “There is only one test” makes the same point: you need a test statistic, a null hypothesis, and a way to compute how unlikely any value of the test statistic is under the null. The last thing we approximated by statistical tests in the past. Nice work!
This article reminds me of “Common statistical tests are linear models”. In general, the way hypothesis testing is taught is needlessly complicated, providing students with the wrong impression that they need to remember what each test does in a given setting (even though it’s much easier to think of these tests as variations of a linear model or in terms of simulation as you argue).
https://lindeloev.github.io/tests-as-linear/