When are likely answers right? On Sequence Probability and Correctness in LLMs
Summary
A new study quantifies the relationship between sequence probability and correctness in large language models (LLMs) across various decoding methods, hyperparameters, and prompt-answer pairs. Researchers found that higher sequence probability often predicts correctness when comparing different prompt-answer pairs within a fixed dataset. However, this correlation does not reliably extend to decoding decisions; increasing sequence probability by altering hyperparameters or methods does not consistently improve accuracy. Furthermore, sequence probability proves to be an unreliable indicator of correctness for repeated responses generated from the same prompt. These findings offer crucial clarity on when decoding strategies can genuinely enhance LLM correctness and provide practical guidance for self-consistency and verifier-free self-improvement techniques.
Key takeaway
For ML Engineers and AI Scientists optimizing LLM outputs, you should reconsider relying solely on increasing sequence probability through decoding methods or hyperparameters to boost accuracy. While sequence probability can indicate correctness within a dataset, it does not reliably transfer to improving model performance via decoding changes. Focus your efforts on strategies that address the fundamental alignment of model likelihood with truth, rather than assuming higher probability always means better answers, especially for repeated generations.
Key insights
Higher sequence probability often predicts correctness within datasets, but not across decoding decisions or repeated responses.
Principles
- Sequence probability predicts correctness across prompt-answer pairs within fixed datasets.
- Increasing sequence probability via decoding changes does not reliably improve accuracy.
- Sequence probability is not a good correctness indicator for repeated responses to the same prompt.
In practice
- Re-evaluate decoding strategies.
- Inform self-consistency approaches.
- Guide verifier-free self-improvement.
Topics
- Large Language Models
- Decoding Methods
- Sequence Probability
- Model Correctness
- Self-consistency
- Verifier-free Self-improvement
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.