Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

The Bag of Dims framework introduces a training-free mechanistic interpretability approach, demonstrating that transformer hidden state dimensions function as independent binary registers, encoding semantic content via their signs (±1) and confidence via their magnitudes. This framework was validated across Qwen 3.5-4B, Gemma 3-4B, and Mistral 7B models through four progressive experiments. Sign patterns alone achieved 72–93% top-5 next-token accuracy and 80–90% top-4096 pure Hamming prediction without any decoder. The method discovered 175 semantic categories with a mean AUC of 0.80 using a single-token type cache and 50 anchors, requiring zero training. These features persist in K and V attention projections, and FFN neurons write them in an axis-aligned manner. Unsupervised discovery yielded 1500 features with 100% yield and 99% sparsity, confirming low inter-dimension coupling (0.0014 bits MI).

Key takeaway

For AI Scientists and MLOps Engineers seeking to understand or debug large language models, this research suggests you can interpret internal states without costly training. You should explore "Bag of Dims" to directly read semantic features from standard basis dimensions, leveraging sign patterns for insights into model computation. This approach offers a fast, training-free alternative to traditional interpretability methods, requiring only a single forward pass per vocabulary token.

Key insights

Transformer hidden state dimensions act as independent binary registers, encoding semantic features via signs without training.

Principles

Transformer dimensions encode content via sign and confidence via magnitude.
Cross-dimension structure offers no practical benefit for feature reading.
FFN neurons write axis-aligned sign patterns via down_proj majority vote.

Method

The Bag of Dims method involves creating a single-token type cache, then discovering features by computing per-dimension AUC for anchor tokens against the full vocabulary to build sign prototypes.

In practice

Use a single forward pass per vocabulary token to build a type cache.
Score new tokens by counting sign matches with registered dimension subsets.
Inspect FFN down_proj weights to link neurons to axis-aligned features.

Topics

Mechanistic Interpretability
Transformer Hidden States
Feature Discovery
Language Model Analysis
FFN Neuron Circuits
Training-Free Methods

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.