Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

2026-04-14 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

Hessian-Enhanced Token Attribution (HETA) is a new framework designed to interpret predictions from decoder-only autoregressive large language models (LLMs). Existing attribution methods, often designed for encoder-based architectures, struggle with the causal and semantic complexities of autoregressive generation due to their reliance on linear approximations. HETA addresses this by integrating three components: a semantic transition vector to capture token-to-token influence across layers, Hessian-based sensitivity scores for second-order effects, and KL divergence to quantify information loss from token masking. This unified approach aims for context-aware, causally faithful, and semantically grounded attributions. The authors also introduced a benchmark dataset for evaluating attribution quality in generative settings. Empirical evaluations show HETA outperforms current methods in faithfulness and alignment with human annotations.

Key takeaway

For research scientists developing or deploying autoregressive LLMs, HETA offers a more robust and accurate method for understanding model predictions than previous techniques. You should consider integrating HETA into your interpretability toolkit, especially when causal faithfulness and semantic grounding are critical. This framework provides a new standard for evaluating and explaining complex generative model behaviors, potentially improving trust and debugging capabilities.

Key insights

HETA offers a novel, unified framework for interpreting autoregressive LLMs by combining semantic, Hessian, and KL divergence components.

Principles

Linear approximations are insufficient for autoregressive LLM attribution.
Second-order effects are crucial for causal faithfulness.
Context-awareness improves attribution quality.

Method

HETA combines a semantic transition vector, Hessian-based sensitivity scores, and KL divergence to measure information loss for context-aware, causally faithful, and semantically grounded attributions in decoder-only LLMs.

In practice

Use HETA for interpreting decoder-only LLM predictions.
Evaluate attribution methods using the new benchmark dataset.

Topics

Hessian-Enhanced Token Attribution
Autoregressive LLMs
Token Attribution
Model Interpretability
Decoder-Only Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.