Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Hessian-Enhanced Token Attribution (HETA) is a novel framework designed to interpret autoregressive, decoder-only Large Language Models (LLMs) by quantifying input token contributions to generated outputs. Existing attribution methods, often based on linear approximations or designed for encoder architectures, struggle with the causal and semantic complexities of autoregressive generation. HETA integrates three components: a semantic transition vector for token-to-token influence, Hessian-based sensitivity scores for second-order effects, and KL divergence to measure information loss upon token masking. This unified approach aims for context-aware, causally faithful, and semantically grounded attributions. The researchers also introduced a curated benchmark dataset for evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate HETA's superior performance in attribution faithfulness and alignment with human annotations, setting a new standard for interpretability in autoregressive LLMs.

Key takeaway

For AI Engineers and Research Scientists developing or deploying decoder-only LLMs, HETA offers a robust method for understanding token contributions. Its ability to capture non-linear interactions and causal paths provides more faithful and stable attributions compared to traditional gradient or attention-based methods. You should consider integrating HETA for critical applications requiring high interpretability, especially when debugging model behavior or ensuring ethical compliance, as it consistently outperforms baselines in faithfulness and human alignment.

Key insights

HETA improves LLM interpretability by combining semantic flow, Hessian-based sensitivity, and KL divergence for causally faithful token attribution.

Principles

Method

HETA combines a semantic transition vector, Hessian-based sensitivity scores, and KL divergence to measure information loss, producing context-aware, causally faithful, and semantically grounded attributions for decoder-only LLMs.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.