Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
Summary
The Bag of Dims framework introduces a training-free mechanistic interpretability approach, demonstrating that transformer hidden state dimensions function as independent binary registers, encoding semantic content via their signs (±1) and confidence via their magnitudes. This framework was validated across Qwen 3.5-4B, Gemma 3-4B, and Mistral 7B models through four progressive experiments. Sign patterns alone achieved 72–93% top-5 next-token accuracy and 80–90% top-4096 pure Hamming prediction without any decoder. The method discovered 175 semantic categories with a mean AUC of 0.80 using a single-token type cache and 50 anchors, requiring zero training. These features persist in K and V attention projections, and FFN neurons write them in an axis-aligned manner. Unsupervised discovery yielded 1500 features with 100% yield and 99% sparsity, confirming low inter-dimension coupling (0.0014 bits MI).
Key takeaway
For AI Scientists and MLOps Engineers seeking to understand or debug large language models, this research suggests you can interpret internal states without costly training. You should explore "Bag of Dims" to directly read semantic features from standard basis dimensions, leveraging sign patterns for insights into model computation. This approach offers a fast, training-free alternative to traditional interpretability methods, requiring only a single forward pass per vocabulary token.
Key insights
Transformer hidden state dimensions act as independent binary registers, encoding semantic features via signs without training.
Principles
- Transformer dimensions encode content via sign and confidence via magnitude.
- Cross-dimension structure offers no practical benefit for feature reading.
- FFN neurons write axis-aligned sign patterns via down_proj majority vote.
Method
The Bag of Dims method involves creating a single-token type cache, then discovering features by computing per-dimension AUC for anchor tokens against the full vocabulary to build sign prototypes.
In practice
- Use a single forward pass per vocabulary token to build a type cache.
- Score new tokens by counting sign matches with registered dimension subsets.
- Inspect FFN down_proj weights to link neurons to axis-aligned features.
Topics
- Mechanistic Interpretability
- Transformer Hidden States
- Feature Discovery
- Language Model Analysis
- FFN Neuron Circuits
- Training-Free Methods
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.