Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

A new field-theoretic framework is introduced for Transformer mechanistic interpretability, treating the residual stream as a depth-token field. This approach formulates activation patching as localized source insertion and predicts patch effects using sensitivity fields and Green-function responses. Empirical tests on GPT-2-style autoregressive Transformers confirm a bounded local linear regime, where first-order sensitivities predict patch effects across residual sites. The framework also measures structured anisotropic propagation and shows that prompt-induced residual displacements can transfer answer behavior. This establishes response objects like sensitivities, propagated fields, and Green-operator slices as a practical language for organizing patching experiments and a mathematical basis for patch-site inference and cross-scale transfer.

Key takeaway

For AI Scientists and Machine Learning Engineers focused on Transformer interpretability, this field-theoretic framework offers a principled shift from enumerative patching to predictive, operator-based analysis. You should explore using autograd sensitivities to efficiently predict patch effects and identify critical intervention sites. This approach provides a mathematical basis for inferring optimal patch locations and understanding how model behavior scales across different Transformer sizes, streamlining mechanistic interpretability efforts.

Key insights

A field-theoretic framework unifies Transformer patching, prediction, and interpretability through response functions.

Principles

Patching can be predicted by response quantities.
Localized perturbations propagate anisotropically across depth and tokens.
Sensitivity field norm identifies high-influence sites.

Method

Formulate residual stream as a depth-token field; model patching as localized source insertion; predict effects via sensitivity fields and Green-function responses; use adjoint variational problem for patch selection.

In practice

Use autograd sensitivities to predict patch effects efficiently.
Identify high-sensitivity sites for targeted interventions.
Analyze sliced Green operators for local residual-to-residual geometry.

Topics

Mechanistic Interpretability
Transformer Architectures
Activation Patching
Green Functions
Linear Response Theory
GPT-2

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.