Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink
Summary
A study on Mamba-2 reveals that the common mechanistic interpretability assumption—that probes identifying a representational signature also identify the executing circuit—can systematically fail. Researchers found that single-bucket probes for the Mamba-2 state sink, analogous to the attention sink, recover only a small execution layer while missing a much larger detection layer with the same representational signature. The state sink decomposes into two functional head sets: BOS-specialist heads (about 5% of heads at 2.7B) causally support BOS-context and newline-target predictions, while dual heads (27-35% of heads) show stronger representational similarity but weaker causal effects. Ablating BOS-specialist heads collapsed RULER NIAH retrieval accuracy from 1.00 to 0.00 at 1024 context length in both Mamba-1 2.8B and Mamba-2 2.7B, confirming their functional importance. This distinction, implicating Mamba-2's head-shared Delta projection, highlights that separating execution from detection circuits requires class-conditional ablation.
Key takeaway
For Machine Learning Engineers interpreting Mamba-2's internal mechanisms, recognize that single-bucket probes may identify detection layers without corresponding execution circuits. Your interpretability efforts should incorporate class-conditional ablation to differentiate functional head sets, as representational similarity alone does not guarantee causal effect. This distinction is critical for accurately understanding and modifying model behavior, especially for tasks like RULER NIAH retrieval, where BOS-specialist heads are crucial.
Key insights
Representational similarity in Mamba-2 does not imply functional equivalence for mechanistic interpretability probes.
Principles
- Probes can identify detection without execution.
- Ablation is key to distinguish functional circuits.
- Mamba-2 state sink has two functional head sets.
Method
Distinguish detection from execution circuits by using class-conditional ablation, rather than just class-conditional cosine similarity, especially when probes recover both at coarse granularity.
In practice
- Use class-conditional ablation for Mamba-2 interpretability.
- Evaluate probe findings with causal effect tests.
- Beware of representational similarity alone.
Topics
- Mechanistic Interpretability
- Mamba-2
- State Sink
- Neural Network Probes
- Causal Ablation
- Large Language Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.