Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes
Summary
A new method, "closure-validated circuit discovery," identifies functional groups of attention heads within large language models by clustering co-activation statistics and then validating these clusters through causal ablation. This approach, adapted from sparse autoencoder clustering, was applied to two dense 1B-scale models, Pythia 1B and OLMo 1B, and a Mixture-of-Experts model, OLMoE-1B-7B. For Pythia 1B and OLMo 1B, the discovered attention head communities successfully passed the closure test, demonstrating that ablating these groups caused predictable per-example damage. However, in the OLMoE-1B-7B model, route-conditional clustering yielded a signal that failed the closure test; surprisingly, ablation improved loss rather than damaging it. The research concludes that co-activation signals serve as circuit proposals, with causal ablation via a closure test being essential for confirming actual functional circuits.
Key takeaway
For AI Scientists and NLP Engineers developing or interpreting large language models, you should integrate causal ablation and closure tests into your circuit discovery workflows. Relying solely on co-activation statistics for identifying functional attention head circuits is insufficient, as these only provide proposals. Your validation process must include rigorous ablation to confirm actual circuit functionality, especially when working with complex architectures like Mixture-of-Experts models, where ablation might yield counter-intuitive results.
Key insights
Co-activation proposes circuits, but causal ablation via closure confirms them in attention heads.
Principles
- Co-activation is a circuit proposal.
- Causal ablation validates circuit function.
- Closure separates proposals from confirmed circuits.
Method
Cluster attention heads using co-activation statistics, then perform a closure test by causally ablating the discovered community and comparing per-example damage to matched-random controls.
In practice
- Use closure tests for circuit validation.
- Distinguish proposals from confirmed circuits.
- Apply to dense and MoE models.
Topics
- Attention Heads
- Circuit Discovery
- Causal Ablation
- Model Interpretability
- Large Language Models
- Mixture-of-Experts
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.