Demystifying Variance in Circuit Discovery of LLMs
Summary
A new study investigates the substantial variability in circuit discovery methods for Large Language Models (LLMs), a key technique in mechanistic interpretability. The method EAP-IG, widely used for its performance on unfaithfulness metrics, suffers from resampling variance, where circuits change with new data batches; rephrasing variance, where circuits shift with rephrased prompts; and sample-wise variance, showing large unfaithfulness fluctuations across individual samples. The paper introduces CEAP, a new circuit discovery method with theoretical guarantees, which significantly reduces resampling variance. It demonstrates that rephrasing variance arises because different prompt templates activate distinct circuits, suggesting inherent difficulty in steering LLMs with comprehensive task circuits. The study also argues that sample-wise variance is largely benign, often stemming from the definition of unfaithfulness and selective contribution scaling rather than circuit defects. Sparsity, often claimed to create more compact circuits, does not solve the rephrasing problem.
Key takeaway
For AI Scientists and NLP Engineers interpreting LLM behavior, recognize that circuit discovery results can vary significantly. Your interpretability studies should account for rephrasing variance, as different prompts activate distinct circuits, making comprehensive steering difficult. While CEAP can reduce resampling variance, understand that sample-wise unfaithfulness scores may stem from definition rather than circuit defects. This implies a need for robust validation across diverse prompt templates.
Key insights
LLM circuit discovery faces significant variance; CEAP reduces resampling, but rephrasing variance implies inherent LLM steering difficulty.
Principles
- Different prompt templates activate distinct LLM circuits.
- Comprehensive task circuits for LLMs are inherently challenging.
- Sample-wise unfaithfulness variance is largely benign.
Method
CEAP, a new circuit discovery method, improves upon EAP-IG with theoretical guarantees to substantially lessen resampling variance in LLM mechanistic interpretability.
Topics
- LLM Interpretability
- Circuit Discovery
- Mechanistic Interpretability
- Model Variance
- Prompt Engineering
- CEAP Method
- EAP-IG
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.