The Sparse Autoencoders bubble has popped, but they are still promising
Sparse Autoencoders are still a very exciting research direction to understand how neural networks work.

Mechanistic interpretability researchers were really excited about Sparse Autoencoders (SAEs) in 2023 and especially in 2024. Now I talk to people that tell me SAEs don’t work, implying I should work on something else. I think this is an overcorrection.
What are SAEs? Sparse autencoders are a tool for interpretability research, useful for discovering the concept units that an LLM is using to think about things. The way they work is they have a dictionary of ‘concepts’, each of which is a direction in high-dimensional space. SAEs are trained to use these concepts to compress the LLM’s activation vector, at any point, into a much smaller set of concepts which are supposed to be able to reconstruct the original activations.
Do SAEs work? SAEs kind of work now, and promise to work quite well in the future. I believe this, despite the fact that SAEs did not help me at all in my own interpretability research.1 I didn’t follow the hype peak when it was active in 2024, so I have avoided disappointment.
I think SAEs could work really well. They are humanity’s best attempt to date at decomposing the concepts used by an LLM. And finding those concepts is the only way to make progress on fully understanding how the LLMs think. We should keep trying to get sparse autoencoders right.
The hype cycle until now
Who invented SAEs, and what for?
Chris Olah co-founded Anthropic, where his team worked on attempting to understand how LLMs made decisions. But they ran into the problem that, unlike previous networks Olah had worked with, transformer residual streams have no privileged directions which were likely to represent simple (as opposed to composite) concepts.
In 2023, two teams invented sparse autoencoders to try and figure out which directions were representing simple concepts: Cunningham et al. at Conjecture, and Bricken et al. at Anthropic.
What happened when people got excited?
Sparse autoencoders quickly became a very popular topic in mechanistic interpretability. A lot of this was due to Neel Nanda’s charisma and sheer amount of excitable young people he was mentoring. Anthropic briefly released Golden Gate Claude, which consisted of training an SAE and pushing the “Golden Gate” concept up, which made the LLM talk about the Golden Gate constantly.
A bunch of innovations in architectures and practice training SAEs happened after this: SAEs rely on sparsity, which is a difficult objective to optimize, so it’s not obvious how to do it. Here are some of the methods invented during this time.
‘Ghost grads’ to prevent the latents from dying due to the L1 penalty
Top-K activations, then Batch Top-K (now standard in open-source SAEs.)
Crosscoders: SAEs trained on several models simultaneously, which can show which concepts the models have in common.
Growing discontent
SAEs don’t explain the residual activations very well. Despite having many latents, they make a lot of errors when trying to reconstruct the LLM vector. Still now, they only explain about 80% of the variance.
Various reasons accumulated to think maybe SAEs didn’t work:
Lee Sharkey’s group gave up on SAEs, for theoretical reasons about what they believe the nature of concepts in Transformers is. They started working on weight decomposition instead. This is similar to an SAE, but decomposes weights (that is, functionality) instead of concepts.
A paper by Heap et al. (2025) showed that the SAEs + auto-interpretability pipeline seemed to work just as well on randomly initialized models, as on trained models.
Not all the results from that paper are negative. The paper has careful analysis that tries to find the reason why SAEs aren’t better on trained transformer models. One of the findings is that they are better on small neural networks where which concepts are in the inputs is known.
It seems like they’re more likely showing that auto-interpretability is still pretty bad. I also think auto-interpretability has promise, because it is bottlenecked by AI writing specific descriptions and that will get better.
The inflection point seemed to be when the Google Deepmind interpretability team wrote that they were disappointed with SAEs, were stopping work on them and starting to work on probes instead.
(Probes are also linear features, just like SAE concepts. They’re trained with logistic regression to predict particular simple labels. They work incredibly well at e.g. detecting strategic lies.)
SAEs work decently well
They work OK for finding concepts. Many things they find are uninterpretable, but a good portion of them are interpretable!
Thanks to Sam Marks’ work, it’s possible to investigate circuits based on SAE values, even though they do not explain the residual very well.
The trick is to add an “error” node that contains the difference between the SAE and the model, and the residual is these two things added
If your circuit is well described by what the SAE predicts, you’ll see that the error node doesn’t belong in the circuit. If your SAE isn’t capturing the relevant features very well, then you can see that the error node is very important and definitely belongs in the circuit.
This is what Anthropic used for the large number of circuits they investigated in “On the biology of a large language model”.
We started running “top-k prediction” evaluations (SAEBench), where all the SAE features are tested to see if they predict something simple, like whether a word starts with A.
Many of the simple concepts are well predicted by some SAE feature.
There is hope for better SAEs
Here are some ways in which SAEs can improve.
SAEs rely on the linear + superposition model of concepts, posited by “Toy models of superposition”. Perhaps once we find an alternative model for concept representation in LLMs, they’ll just start working. For example, the Minkowski representation hypothesis; that features are intermediate points between archetypes, rather than linear directions.
Myself, I think that we need to lock in on engineering and try lots of different things. For example:
Use previous tokens to decide what latents to choose
Perhaps the model uses previous concepts to check in this residual. We could use the attention weights from the model itself, and just train an adapter to do SAE selection.
Try various things to optimize the crazy array of latents. For example, reinforcement learning on choosing which latents to sample.
What if we kept adding latents and ruthlessly discarding the ones that don’t work? This doesn’t work as well as ghost grads when optimizing the latents with gradient descent directly, without RL; but could work if we’re upweighing things that work well and downweighing things that work less well, as policy gradient methods do.
This is because I worked with small recurrent neural networks, not large transformers, so we could do manual analysis (which is still better, but is too expensive at scale).
