Language Model Circuits Are Sparse in the Neuron Basis
Summary
This research empirically demonstrates that Multilayer Perceptron (MLP) neurons in language models provide a feature basis as sparse and interpretable as those derived from Sparse Autoencoders (SAEs). It introduces an end-to-end pipeline for circuit tracing on the MLP neuron basis, utilizing gradient-based attribution. On a standard subject-verb agreement benchmark, approximately 100 MLP neurons were sufficient to control model behavior. For a multi-hop city-to-state-to-capital task, specific neuron sets were found to encode latent reasoning steps and could be steered to alter model output. This approach advances automated interpretability for models like Llama 3.1-8B-Instruct without incurring additional training costs associated with SAEs, by focusing on MLP activations and employing the efficient RelP attribution method.
Key takeaway
For AI Scientists and Machine Learning Engineers focused on LLM interpretability, you should reconsider the necessity of training Sparse Autoencoders. This work demonstrates that directly analyzing MLP activations with efficient attribution methods like RelP can yield equally sparse and faithful circuits. Adopting this neuron-level approach can reduce computational overhead and avoid reconstruction errors inherent in learned bases, providing a more direct path to understanding and steering model behavior.
Key insights
MLP activations, combined with RelP attribution, offer a sparse and faithful basis for language model circuit tracing, comparable to SAEs.
Principles
- MLP activations are a privileged basis for sparsity.
- Gradient-based attribution methods significantly impact circuit sparsity.
- Faithful circuits reveal unverbalized internal reasoning steps.
Method
An end-to-end pipeline traces circuits on the MLP neuron basis, using RelP, a gradient-based attribution method, applied to MLP activations for causal localization.
In practice
- Trace causal circuits in Llama 3.1-8B-Instruct.
- Steer specific neuron clusters to alter model output.
- Identify neurons encoding multi-hop reasoning steps.
Topics
- Language Model Interpretability
- Circuit Tracing
- MLP Activations
- Sparse Autoencoders
- RelP Attribution
- Causal Interpretability
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.