Language Model Circuits Are Sparse in the Neuron Basis

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

This research empirically demonstrates that Multilayer Perceptron (MLP) neurons in language models provide a feature basis as sparse and interpretable as those derived from Sparse Autoencoders (SAEs). It introduces an end-to-end pipeline for circuit tracing on the MLP neuron basis, utilizing gradient-based attribution. On a standard subject-verb agreement benchmark, approximately 100 MLP neurons were sufficient to control model behavior. For a multi-hop city-to-state-to-capital task, specific neuron sets were found to encode latent reasoning steps and could be steered to alter model output. This approach advances automated interpretability for models like Llama 3.1-8B-Instruct without incurring additional training costs associated with SAEs, by focusing on MLP activations and employing the efficient RelP attribution method.

Key takeaway

For AI Scientists and Machine Learning Engineers focused on LLM interpretability, you should reconsider the necessity of training Sparse Autoencoders. This work demonstrates that directly analyzing MLP activations with efficient attribution methods like RelP can yield equally sparse and faithful circuits. Adopting this neuron-level approach can reduce computational overhead and avoid reconstruction errors inherent in learned bases, providing a more direct path to understanding and steering model behavior.

Key insights

MLP activations, combined with RelP attribution, offer a sparse and faithful basis for language model circuit tracing, comparable to SAEs.

Principles

MLP activations are a privileged basis for sparsity.
Gradient-based attribution methods significantly impact circuit sparsity.
Faithful circuits reveal unverbalized internal reasoning steps.

Method

An end-to-end pipeline traces circuits on the MLP neuron basis, using RelP, a gradient-based attribution method, applied to MLP activations for causal localization.

In practice

Trace causal circuits in Llama 3.1-8B-Instruct.
Steer specific neuron clusters to alter model output.
Identify neurons encoding multi-hop reasoning steps.

Topics

Language Model Interpretability
Circuit Tracing
MLP Activations
Sparse Autoencoders
RelP Attribution
Causal Interpretability

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.