Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

2026-05-30 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Query Lens is a novel method designed to enhance the interpretability of sparse features derived from sparse autoencoders, addressing the challenges of reliably characterizing these features. Extending the existing Logit Lens technique, Query Lens offers more comprehensive and faithful interpretations by jointly analyzing encoder-side key features and decoder-side value features. This approach allows for the identification of specific inputs that activate a feature and the corresponding outputs it promotes. Crucially, Query Lens also incorporates indirect, module-mediated effects that occur when a feature is processed by subsequent modules, a capability that goes beyond the direct effects captured by Logit Lens. Experimental results indicate that Query Lens successfully generates coherent token signatures for features that Logit Lens previously could not interpret. The paper also introduces the Subspace Channel Hypothesis, proposing that downstream modules access features via layer-specific subspaces.

Key takeaway

For Machine Learning Engineers working with sparse autoencoders to understand model internals, Query Lens offers a more robust interpretation framework. You should consider integrating Query Lens to characterize features that remain opaque with traditional Logit Lens methods, especially when indirect, module-mediated effects are suspected. This can lead to more coherent token signatures and deeper insights into how features influence downstream processing, improving your ability to debug and refine models.

Key insights

Query Lens improves sparse feature interpretability by analyzing both direct and indirect effects in autoencoders.

Principles

Jointly consider key and value features.
Account for module-mediated indirect effects.
Downstream modules use layer-specific subspaces.

Method

Query Lens extends Logit Lens by jointly considering encoder-side key and decoder-side value features, identifying activation inputs and promoted outputs, and incorporating indirect, module-mediated effects from downstream processing.

In practice

Characterize previously uninterpretable features.
Gain deeper insights into sparse autoencoder behavior.

Topics

Sparse Autoencoders
Feature Interpretability
Query Lens
Logit Lens
Model Understanding
Subspace Channel Hypothesis

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.