When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals
Summary
Block Attention Residuals (Block AttnRes) replace fixed additive residuals with a learned softmax over earlier depth-source representations, making cross-layer routing an inspectable tensor. This study investigates whether this architectural exposure suffices for mechanistic interpretation by probing two \$0.6$B same-scale Qwen3 checkpoints: a vanilla Qwen3 with a deterministic recency-bias schedule and a Block AttnRes Qwen3 trained from scratch. The wrapped baseline's routing weights were content-independent, reproducing the schedule's analytic prediction. In contrast, the trained AttnRes checkpoint revealed three localized routing motifs: an embedding-source pathway, a current-state pathway, and an older-history pathway. Crucially, a sharp dissociation was found between average routing mass and causal importance, with the largest mass slice not being the largest causal contributor. Architectural exposure of routing is thus necessary but not sufficient for mechanistic interpretation, requiring routing to be part of training for structured depth routing and causal interventions to validate descriptive summaries.
Key takeaway
For AI Scientists and NLP Engineers focused on model interpretability, merely exposing internal routing mechanisms like Block AttnRes is insufficient. You must ensure routing is an integral part of the model's training process to achieve structured, causally meaningful depth routing. Always validate descriptive routing summaries with rigorous causal interventions, as high routing mass does not guarantee significant causal impact. This approach ensures your interpretability efforts yield genuine mechanistic understanding.
Key insights
Architectural routing exposure is necessary but insufficient for mechanistic interpretation.
Principles
- Structured depth routing emerges only when routing is part of training.
- Descriptive routing summaries are hypotheses, not mechanism evidence.
- Largest routing mass does not equate to largest causal contribution.
Method
Causal probes and routing-ablation interventions are used to test routing hypotheses.
In practice
- Test descriptive routing summaries with causal interventions.
- Integrate routing into training for structured depth routing.
Topics
- Block Attention Residuals
- Model Interpretability
- Causal Probing
- Neural Network Routing
- Qwen3
- Mechanistic Interpretability
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.