Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
Summary
Hessian-Enhanced Token Attribution (HETA) is a novel framework designed to interpret autoregressive, decoder-only Large Language Models (LLMs) by quantifying input token contributions to generated outputs. Existing attribution methods, often based on linear approximations or designed for encoder architectures, struggle with the causal and semantic complexities of autoregressive generation. HETA integrates three components: a semantic transition vector for token-to-token influence, Hessian-based sensitivity scores for second-order effects, and KL divergence to measure information loss upon token masking. This unified approach aims for context-aware, causally faithful, and semantically grounded attributions. The researchers also introduced a curated benchmark dataset for evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate HETA's superior performance in attribution faithfulness and alignment with human annotations, setting a new standard for interpretability in autoregressive LLMs.
Key takeaway
For AI Engineers and Research Scientists developing or deploying decoder-only LLMs, HETA offers a robust method for understanding token contributions. Its ability to capture non-linear interactions and causal paths provides more faithful and stable attributions compared to traditional gradient or attention-based methods. You should consider integrating HETA for critical applications requiring high interpretability, especially when debugging model behavior or ensuring ethical compliance, as it consistently outperforms baselines in faithfulness and human alignment.
Key insights
HETA improves LLM interpretability by combining semantic flow, Hessian-based sensitivity, and KL divergence for causally faithful token attribution.
Principles
- Attention weights alone are insufficient for causal attribution.
- First-order gradients miss non-linear token influence.
- Second-order effects are crucial for capturing full sensitivity.
Method
HETA combines a semantic transition vector, Hessian-based sensitivity scores, and KL divergence to measure information loss, producing context-aware, causally faithful, and semantically grounded attributions for decoder-only LLMs.
In practice
- Use HETA for more reliable LLM interpretability.
- Evaluate attribution methods with the NarrativeQA $\oplus$ SciQ dataset.
- Consider low-rank or windowed HETA for efficiency.
Topics
- Hessian-Enhanced Token Attribution
- Autoregressive LLMs
- Model Interpretability
- Attribution Methods
- Decoder-Only Architectures
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.