Two rules for good research taste
Avoid getting nerd-sniped by methods with attractive equations
When you’re expanding the frontier of useful knowledge, the space of possible problems to work on and possible solutions is too large. Because our frontier has significantly expanded since the origins of the Baconian project, no matter how ambitious you are, you need to focus on some area of study. Within that area of study there is a deluge of books and papers and theories, and you need some way to filter them. How?
Working scientists have research taste: heuristics that we use to judge what problems are worth tackling, and what papers are worth reading.1 Research taste is deeply personal: each scientist has their unique preconceptions that shape our beliefs about what’s good and not. At the same time, taste is objective: ideally it picks, perfectly precisely, which problems can be solved and which approaches will work. Taste is forged by repeated contact with reality, so it converges; but expanding the frontier requires novelty (or relentless optimization), so it diverges.
Taste is tacit knowledge that cannot be effectively transmitted without direct experience. However, it is possible to describe in broad strokes. In this essay, I will transmit to you some patterns I have found, that make some research problems taste good and some taste bad.
What does ‘research taste’ feel like?
To me, research taste manifests as reflexive skepticism or excitement when encountering a new idea or considering a new problem. In extreme cases, I get really agitated and my heart beats faster, or my throat clenches and I feel disgust. It’s entangled with, but distinct from, tacit knowledge of how to write papers, effectively run empirical projects, or write theorems. In machine learning at least, it does manifest in the day-to-day of empirical work: how should I think about this subproblem, what should I measure, when should I give up on this approach.
Research taste also changes over time, from experience. For example, as a young apprentice, I was pretty good at programming and wanted to do empirical work in machine learning. But I was fascinated by methods with strong theoretical justification, and mathematics I could barely grasp. I believed that, because this was so inaccessible, when I mastered it I would have access to truths few others could have, which would allow me to make Real Progress. I regretted not studying mathematics earlier.
In reality, I was wrong and misguided. The experience was extremely depressing: I spent years climbing the mountain of Bayesian machine learning only to, upon reaching the summit, understand the reasons it would never work. I felt like I had been lied to, scammed; all these illustrious professors and experienced postdocs had to have known, how could they not?, and just hadn’t told me how useless it would be. But no, academia is full of nerd-sniping memes, psychofauna, ideas that for some reason latch onto people while failing to pay their rent in correct predictions, over and over.2
Anyways, here are my actual rules about taste.
Theory before validating experiments is like the cart before the horse
If the research direction requires lots of math but has lots of ‘promise’ instead of impressive results, it is slop.
Central examples for me are Koopman operators, and Bayesian machine learning (not structured Bayesian statistical models, those work).3 The actual empirical evidence for Bayesian deep learning specifically improving over its non-Bayesian counterparts is thin,4 and Gaussian processes were on their way out when I started working on them. The main reason Bayes continues to attract people are the philosophical arguments that it is ‘optimal’ (Dutch-booking) and widespread incorrect beliefs about how good it is to believe probability distributions.
Other things are neuro-symbolic AI, rigorous logics for knowledge representation, Kolmogorov-Arnold Networks, physics-informed neural networks, probabilistic programming languages. PAC-Bayes (a frequentist technique, despite the name) and overall generalization bounds. Singular learning theory also tastes really badly to me, though I am mollified by the people running Timaeus are going into it with wariness about just being nerd-sniped.
Sometimes complicated math is required to learn a theory only because when it was developed, computation was expensive. But if a modern theory is this way, beware: it has gained mind-share not by being useful, but by nerd-sniping people like you.
Very large numbers exist, but infinites do not
If something emerges only in the limit of unbounded numbers, instead of merely very large numbers, it’s usually slop.
For example: measure theory is the mathematical foundation of probability. But it is only necessary when you have infinite sets of possible events. In finite sample spaces, all possible events correspond to measurable subsets, so there’s no need to develop the theory in the first place. If your clever probabilistic argument depends on defining things in a way that a very high resolution finite partition cannot approximate, it does not describe reality and you have to judge it as mathematical performance art.
Proofs of convergence in Markov Chain Monte Carlo are like that as well. The ergodicity theorem guarantees that as the iterations N->∞, the average of samples will probably converge to the true expectation you’re estimating. In reality, however, there is no reliable way to check that an MCMC chain has converged, and that you have explored enough of the distribution to be confident this is the average. (Some diagnostics work if you have a lot of compute relative to the size of the problem, which is not the case for neural networks.)
Another example: conversion the limit, that is not arise from the collective action is many individual, such as the Gaussian process limit of infinitely-wide neural networks, or the Neural Tangent Kernel, probably fail to capture something important about their ostensible object of study, neural networks. However, it’s true that the NTK was a useful tool for proving convergence (in finite but still asymptotic time) of SGD, and have spawned the maximal update parameterization (µP); I’m picking on the single most successful deep learning theory. Others are worse.
‘Mathiness’ and various proofs of things in machine learning papers are also like this.
Build on successful technology
Absent strong empirical results or a good intuition for why a new kind of approach to AI would work, do what has been established as clearly working and build on top of it. Use deep learning! Start with open-source repositories and extend them! It is a bad idea to try to reinvent the foundations of AI.
The earliest reference to ‘research taste’ I could quickly find is this 2009 note by Dah Ming Chiu, but I learned about it from Chris Olah’s essay.
Yes, I think I reached the summit. I understood and tried basically all the ideas in the field of Bayesian machine learning (variational inference, amortization, normalizing flows, Hamiltonian Monte Carlo, Gibbs sampling, Gaussian processes, Stochastic gradient MCMC, inducing points, Power Expectation Propagation, Stein gradient estimators) and a few more besides (the kookiest was Koopman operators). Nothing surprised me anymore. Like a deranged Markov chain model, I could combinatorially generate new ones too: I have a paper draft for a novel method for inducing-point Gaussian process inference with non-conjugate likelihoods that alternates Gibbs sampling and variational inference, much faster and precise than gradient-descent methods for this narrow task. If you don’t understand all the jargon, don’t worry, it’s not actually important.
The other thing that gets me is that I got into Bayesian ML because, at the time I started my PhD, PILCO was the only successful model-based RL algorithm, and it succeeded because it estimated the state’s uncertainty. Unfortunately this was sort of a one-off, though my senior colleague was able to transfer it to ensembles of neural networks.
There is a small tiny bit, which I cannot find a citation for, so you’ll have to believe me. I have witnessed that averaging the predictions of many Bayesian model checkpoints generated over the course of training, outperforms any single Bayesian checkpoint and both single checkpoints and the average of SGD predictions.
