Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability
Summary
A new field-theoretic framework is introduced for Transformer mechanistic interpretability, treating the residual stream as a depth-token field. This approach formulates activation patching as localized source insertion and predicts patch effects using sensitivity fields and Green-function responses. Empirical tests on GPT-2-style autoregressive Transformers confirm a bounded local linear regime, where first-order sensitivities predict patch effects across residual sites. The framework also measures structured anisotropic propagation and shows that prompt-induced residual displacements can transfer answer behavior. This establishes response objects like sensitivities, propagated fields, and Green-operator slices as a practical language for organizing patching experiments and a mathematical basis for patch-site inference and cross-scale transfer.
Key takeaway
For AI Scientists and Machine Learning Engineers focused on Transformer interpretability, this field-theoretic framework offers a principled shift from enumerative patching to predictive, operator-based analysis. You should explore using autograd sensitivities to efficiently predict patch effects and identify critical intervention sites. This approach provides a mathematical basis for inferring optimal patch locations and understanding how model behavior scales across different Transformer sizes, streamlining mechanistic interpretability efforts.
Key insights
A field-theoretic framework unifies Transformer patching, prediction, and interpretability through response functions.
Principles
- Patching can be predicted by response quantities.
- Localized perturbations propagate anisotropically across depth and tokens.
- Sensitivity field norm identifies high-influence sites.
Method
Formulate residual stream as a depth-token field; model patching as localized source insertion; predict effects via sensitivity fields and Green-function responses; use adjoint variational problem for patch selection.
In practice
- Use autograd sensitivities to predict patch effects efficiently.
- Identify high-sensitivity sites for targeted interventions.
- Analyze sliced Green operators for local residual-to-residual geometry.
Topics
- Mechanistic Interpretability
- Transformer Architectures
- Activation Patching
- Green Functions
- Linear Response Theory
- GPT-2
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.