Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
Summary
The "Bag of Dims" framework introduces a training-free, architecture-general method for mechanistic interpretability in transformers, asserting that the standard basis of hidden states inherently encodes semantic content. This approach posits that individual dimensions function as independent binary registers, using signs for semantic content and magnitudes for confidence. Validated across Qwen 3.5-4B, Gemma 3-4B, and Mistral 7B models, the framework demonstrates that sign patterns alone are highly predictive, achieving 72-93% top-5 next-token accuracy even when magnitudes are replaced with unity. Pure Hamming scoring without a decoder reaches 80-90% top-4096 accuracy. The research discovered 175 semantic categories from 50 anchors with zero training, showing a mean AUC of 0.80. Notably, a trained probe added only +0.018 AUC, confirming minimal cross-dimension structure. This interpretability extends to attention mechanisms and FFNs, with 20% of features linked to individual writer neurons. Unsupervised discovery yielded 1500 features with 99% sparsity and low inter-dimension coupling (0.0014 bits pairwise MI), requiring only a single forward pass per vocabulary token.
Key takeaway
For AI Scientists and Machine Learning Engineers focused on model interpretability, this research fundamentally shifts how you approach understanding transformer internals. You can now gain deep insights into semantic feature encoding and neuron contributions without extensive training or GPU-days. This enables rapid, cost-effective analysis of model behavior, allowing you to debug and validate transformer architectures more efficiently. Consider integrating dimension-level sign pattern analysis into your interpretability toolkit for faster, training-free insights.
Key insights
Transformer hidden states' standard basis provides a training-free, dimension-level feature basis encoding semantics via signs.
Principles
- Transformer dimensions act as independent binary semantic registers.
- Sign patterns alone carry significant predictive content.
- Cross-dimension structure in transformer hidden states is negligible.
Method
The "Bag of Dims" framework identifies semantic features by analyzing sign consistency across individual dimensions of transformer hidden states, requiring only a single forward pass per vocabulary token for discovery.
In practice
- Interpret transformer hidden states without additional training or optimization.
- Discover semantic features using sign patterns and Hamming scoring.
- Analyze FFN neuron contributions to specific features.
Topics
- Mechanistic Interpretability
- Transformer Models
- Feature Extraction
- Hidden States
- Large Language Models
- Training-Free Methods
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.