Language Model Circuits Are Sparse in the Neuron Basis

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

This research empirically demonstrates that Multilayer Perceptron (MLP) neurons in language models provide a feature basis as sparse and interpretable as those derived from Sparse Autoencoders (SAEs). It introduces an end-to-end pipeline for circuit tracing on the MLP neuron basis, utilizing gradient-based attribution. On a standard subject-verb agreement benchmark, approximately 100 MLP neurons were sufficient to control model behavior. For a multi-hop city-to-state-to-capital task, specific neuron sets were found to encode latent reasoning steps and could be steered to alter model output. This approach advances automated interpretability for models like Llama 3.1-8B-Instruct without incurring additional training costs associated with SAEs, by focusing on MLP activations and employing the efficient RelP attribution method.

Key takeaway

For AI Scientists and Machine Learning Engineers focused on LLM interpretability, you should reconsider the necessity of training Sparse Autoencoders. This work demonstrates that directly analyzing MLP activations with efficient attribution methods like RelP can yield equally sparse and faithful circuits. Adopting this neuron-level approach can reduce computational overhead and avoid reconstruction errors inherent in learned bases, providing a more direct path to understanding and steering model behavior.

Key insights

MLP activations, combined with RelP attribution, offer a sparse and faithful basis for language model circuit tracing, comparable to SAEs.

Principles

Method

An end-to-end pipeline traces circuits on the MLP neuron basis, using RelP, a gradient-based attribution method, applied to MLP activations for causal localization.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.