When are likely answers right? On Sequence Probability and Correctness in LLMs
Summary
The paper investigates the alignment between sequence probability and correctness in large language models (LLMs) across various decoding methods, models, and benchmarks. Researchers quantified this relationship using 8 decoding methods, 14 models (from Qwen2.5, Qwen3, and Olmo3 families), and 6 benchmark datasets. Key findings indicate that while higher sequence probability often predicts correctness for prompt-answer pairs within a fixed dataset, this relationship does not reliably extend to decoding decisions. Specifically, increasing sequence probability by adjusting hyperparameters or changing methods does not consistently improve accuracy. Furthermore, sequence probability is not a reliable indicator of correctness for repeated responses to the same prompt. The study also notes that within-dataset correlation between probability and correctness generally increases with the model's overall accuracy on the task.
Key takeaway
For AI Scientists and Machine Learning Engineers optimizing LLM inference, understand that increasing sequence probability through decoding method changes or hyperparameter tuning does not reliably improve answer correctness. Instead, focus on the strong correlation between log-probability and correctness within a dataset when evaluating multiple prompt-answer pairs. When implementing self-consistency, prefer majority voting over probability weighting, as within-sample correlations are often weak. Ensure your model has sufficient task accuracy before attempting probability-based verifier-free self-improvement.
Key insights
Sequence probability reliably predicts correctness only within a fixed dataset, not across decoding methods or hyperparameters.
Principles
- Within-dataset log-probability correlates with correctness.
- Tuning decoding hyperparameters does not reliably improve accuracy.
- Verifier-free self-improvement needs sufficient base accuracy.
Method
The study quantifies probability-correctness alignment across 8 decoding methods, 14 LLM models, and 6 benchmarks, analyzing correlations at within-dataset, within-method, across-method, and within-sample levels.
In practice
- Prioritize majority voting over probability weighting for self-consistency.
- Tune decoding hyperparameters per method, model, and dataset.
- Evaluate verifier-free self-improvement only with high baseline accuracy.
Topics
- Large Language Models
- Decoding Methods
- Sequence Probability
- Model Correctness
- Self-Consistency
- Verifier-Free Self-Improvement
- Qwen3
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.