Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation
Summary
A study titled "Speaking in Self-Assessing Tongues" investigates the reliability of large language models' (LLMs) verbalized confidence in their machine translation outputs. Traditional unsupervised methods, relying on internal signals like predicted probabilities, can be misleading and require specific access. Researchers devised five novel verbalized methods to extract per-token confidence from LLMs without needing internal signals. These methods were compared against the models' internal certainty signals for reliability, using fine-grained error detection and calibration as alignment metrics. The findings indicate that both internal and verbalized methods perform similarly in reliability, though performance varies across different LLM architectures. Notably, the study found little to no correlation between the confidence levels expressed by internal and verbalized assessment techniques.
Key takeaway
For NLP Engineers developing machine translation systems, if you are assessing LLM output reliability, consider integrating verbalized confidence extraction methods. These techniques offer a viable alternative to internal signal access, performing comparably in error detection and calibration. You should evaluate specific LLM architectures, as reliability varies, and recognize that verbalized and internal confidence may not align. This allows for more flexible and robust quality assessment.
Key insights
LLMs' verbalized confidence in translation outputs can be as reliable as internal signals, despite no correlation.
Principles
- LLM confidence can be extracted verbally.
- Internal and verbal signals lack correlation.
- Reliability varies by LLM architecture.
Method
Researchers devised five verbalized methods to extract per-token confidence from LLMs in machine translation. These were evaluated against internal signals using fine-grained error detection and calibration.
In practice
- Use verbalized confidence for MT.
- Consider LLM-specific reliability.
- Explore per-token confidence.
Topics
- Large Language Models
- Machine Translation
- LLM Confidence
- Verbalized Assessment
- Error Detection
- Model Calibration
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.