Detecting Translation Hallucinations with Attention Misalignment

2026-04-08 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

A novel method for interpretable token-level quality estimation (QE) in Neural Machine Translation (NMT) is proposed, addressing the challenge of NMT models hallucinating, especially in low-resource or rare language pairs. Unlike black-box methods like output probability entropy, Semantic Entropy, or xCOMET, this approach leverages existing forward (src→tgt) and backward (tgt→src) NMT models. It computes uncertainty signals by comparing transposed cross-attention maps from both models after using teacher forcing on the backward model. This technique extracts 75 attention alignment-based features, categorized into Focus, Reciprocity, and Sink, which are then fed into a lightweight MLP classifier. Experiments on ZH→EN and FR→EN pairs, using a dataset of 15k translations annotated via "LLM-as-a-judge," demonstrate that combining these attention features with output entropy significantly improves QE performance, achieving ROC-AUCs of 0.750 and 0.849 respectively.

Key takeaway

For Machine Learning Engineers building or deploying NMT systems, you should consider integrating this bidirectional attention-based quality estimation method. It provides interpretable, token-level uncertainty signals without retraining the core NMT model, allowing you to allocate resources more efficiently for difficult translations or flag potential hallucinations before deployment. This approach offers a significant improvement over entropy-only methods, especially for typologically distant language pairs.

Key insights

Bidirectional attention map comparison offers an interpretable and efficient way to detect NMT hallucinations.

Principles

Uncertainty does not always mean error.
Attention patterns reveal model grounding.
Combine signals for robust error detection.

Method

Train bidirectional NMTs, generate translations, extract 75 attention-based features (Focus, Reciprocity, Sink) per token, and train a lightweight MLP classifier on these features with frozen NMT weights.

In practice

Use existing forward/backward NMT models.
Train a small classifier on attention features.
Apply to RAG or summarization for grounding.

Topics

Neural Machine Translation
Translation Hallucinations
Attention Misalignment
Quality Estimation
Cross-Attention Maps

Code references

Best for: AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.