Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

2026-06-15 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study titled "Speaking in Self-Assessing Tongues" investigates the reliability of large language models' (LLMs) verbalized confidence in their machine translation outputs. Traditional unsupervised methods, relying on internal signals like predicted probabilities, can be misleading and require specific access. Researchers devised five novel verbalized methods to extract per-token confidence from LLMs without needing internal signals. These methods were compared against the models' internal certainty signals for reliability, using fine-grained error detection and calibration as alignment metrics. The findings indicate that both internal and verbalized methods perform similarly in reliability, though performance varies across different LLM architectures. Notably, the study found little to no correlation between the confidence levels expressed by internal and verbalized assessment techniques.

Key takeaway

For NLP Engineers developing machine translation systems, if you are assessing LLM output reliability, consider integrating verbalized confidence extraction methods. These techniques offer a viable alternative to internal signal access, performing comparably in error detection and calibration. You should evaluate specific LLM architectures, as reliability varies, and recognize that verbalized and internal confidence may not align. This allows for more flexible and robust quality assessment.

Key insights

LLMs' verbalized confidence in translation outputs can be as reliable as internal signals, despite no correlation.

Principles

LLM confidence can be extracted verbally.
Internal and verbal signals lack correlation.
Reliability varies by LLM architecture.

Method

Researchers devised five verbalized methods to extract per-token confidence from LLMs in machine translation. These were evaluated against internal signals using fine-grained error detection and calibration.

In practice

Use verbalized confidence for MT.
Consider LLM-specific reliability.
Explore per-token confidence.

Topics

Large Language Models
Machine Translation
LLM Confidence
Verbalized Assessment
Error Detection
Model Calibration

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.