When Sparse Interventions Earn Trust

Sparse decompositions deserve trust only if they survive matched intervention and validation. Readability is not enough.

Anthropic's circuit-tracing work frames interpretability around a hard question: attribution graphs are useful only if they survive validation. In that setup, sparse features are not treated as ground truth. They are used to build prompt-specific mechanistic hypotheses, and the local replacement model is then checked by comparing its perturbation behavior to the underlying model.

My Mechanics of Meaning paper asks a narrower question at the same epistemic boundary. Instead of tracing a full circuit, I compare two intervention substrates at the same causal site: raw residual writeback and SAE-basis writeback. The question is not whether sparse features are intrinsically privileged. It is whether, under matched-site interventions, hard invariants, endpoint-native accounting, and disturbance accounting, a sparse basis actually buys cleaner causal control.

The answer in MoM is deliberately limited. On Gemma 2 2B, I find an early, task-conditional sparse-precision regime on lexical disambiguation: at layer 4, SAE interventions directionally exceed raw patching on donor-directed effect and more clearly on RMS-based effect-efficiency. That paired layer-4 advantage is directional rather than conventionally decisive. By layer 8 the pattern is near parity, by layer 12 it reverses, and it does not generalize uniformly to CF and COH. Matched PCA, random-projection, and RECON/RESID controls narrow simpler alternatives, but they do not license a universal sparse story.

MoM complements Anthropic's validation agenda by asking whether a sparse basis deserves additional mechanistic trust as an intervention substrate before it is used to support richer circuit claims. Circuit tracing asks whether a replacement model or graph is faithful enough to support mechanism claims. MoM asks a prior and narrower question: whether the SAE basis itself earns trust as an intervention substrate relative to matched baselines.

This connection is reinforced by Anthropic's July 2025 MOLT update, which suggests that transcoders can fracture computation into overly granular pieces and explores a potentially more faithful computational substrate. MoM resonates with that concern from a narrower angle: before treating any sparse decomposition as mechanistically informative, test whether interventions in that decomposition outperform matched baselines without introducing uncontrolled collateral effects.

The broader lesson is simple: explanation should follow validation. Readable features and compelling graphs matter, but the standard is whether a sparse basis earns additional mechanistic trust under explicit validation and matched intervention.