There is a recurring fantasy in interpretability work, somewhere between a wish and an embarrassment. You stare at a residual stream activation ' twelve thousand floats ' and you want to ask it, in plain English, what are you thinking about' Sparse autoencoders give you a thousand sparse latents you then label by inspecting top-activating examples. Attribution graphs give you sprawling diagrams a researcher spends an afternoon parsing. Probes give you a yes/no. All useful. None of them talk back. Anthropic's new paper, Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations , is the first interpretability artifact in a while where the activation talks back. Literally. You point an NLA at a token in a Claude Opus 4.6 transcript and it produces a few bullet points of English describing what the model is thinking. That's the deliverable. The paper is mostly an investigation of whether you should believe it.
learn more