“Features” aren’t always the true computational primitives of a model, but that might be fine anyways

2026-02-02 · Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, medium

Summary

The concept of "feature" in mechanistic interpretability is often debated, with three primary definitions emerging: computational primitives, salient properties of the input, and features assumed to exist for theoretical discussions. Computational primitives are challenging to reverse-engineer, as illustrated by the modular addition network example where the MLP's function was progressively understood from sin/cos factors to a trigonometric integral. Salient properties are practical, referring to model-represented input characteristics that can be manipulated to alter behavior, as seen in SAEs. The article proposes that "features" exist on a spectrum from pure memorization to case analysis/equivalence partitioning, culminating in true computational primitives. This spectrum is demonstrated using a toy vision-language model classifying "bleggs vs rubes," showing how a model could represent two properties (redness, cubeness), four object types, or even memorize individual data points, each being a valid "feature" interpretation depending on the context.

Key takeaway

For research scientists grappling with feature definitions in mechanistic interpretability, recognize that "features" are not monolithic but exist on a spectrum from memorized data points to abstract computational primitives. This understanding should guide your interpretability efforts, allowing you to select the most appropriate feature interpretation for your specific analytical goal, whether it's debugging, formal verification, or understanding model generalization. Avoid rigid definitions and instead focus on what computations neural networks implement and how components contribute to a compact description of that computation.

Key insights

Features in mechanistic interpretability exist on a spectrum from memorization to computational primitives.

Principles

No single "true" feature definition exists.
Features are useful if they reduce loss.
Interpretability benefits from case analysis.

In practice

Consider feature spectrum when interpreting models.
Use case analysis for debugging model behavior.
Recognize memorized facts as valid features.

Topics

Mechanistic Interpretability
Neural Network Features
Sparse Autoencoders
Feature Spectrum
Model Behavior Analysis

Best for: Research Scientist, AI Researcher, AI Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.