Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The "Bag of Dims" framework introduces a training-free, architecture-general method for mechanistic interpretability in transformers, asserting that the standard basis of hidden states inherently encodes semantic content. This approach posits that individual dimensions function as independent binary registers, using signs for semantic content and magnitudes for confidence. Validated across Qwen 3.5-4B, Gemma 3-4B, and Mistral 7B models, the framework demonstrates that sign patterns alone are highly predictive, achieving 72-93% top-5 next-token accuracy even when magnitudes are replaced with unity. Pure Hamming scoring without a decoder reaches 80-90% top-4096 accuracy. The research discovered 175 semantic categories from 50 anchors with zero training, showing a mean AUC of 0.80. Notably, a trained probe added only +0.018 AUC, confirming minimal cross-dimension structure. This interpretability extends to attention mechanisms and FFNs, with 20% of features linked to individual writer neurons. Unsupervised discovery yielded 1500 features with 99% sparsity and low inter-dimension coupling (0.0014 bits pairwise MI), requiring only a single forward pass per vocabulary token.

Key takeaway

For AI Scientists and Machine Learning Engineers focused on model interpretability, this research fundamentally shifts how you approach understanding transformer internals. You can now gain deep insights into semantic feature encoding and neuron contributions without extensive training or GPU-days. This enables rapid, cost-effective analysis of model behavior, allowing you to debug and validate transformer architectures more efficiently. Consider integrating dimension-level sign pattern analysis into your interpretability toolkit for faster, training-free insights.

Key insights

Transformer hidden states' standard basis provides a training-free, dimension-level feature basis encoding semantics via signs.

Principles

Method

The "Bag of Dims" framework identifies semantic features by analyzing sign consistency across individual dimensions of transformer hidden states, requiring only a single forward pass per vocabulary token for discovery.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.