When are likely answers right? On Sequence Probability and Correctness in LLMs

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The paper investigates the alignment between sequence probability and correctness in large language models (LLMs) across various decoding methods, models, and benchmarks. Researchers quantified this relationship using 8 decoding methods, 14 models (from Qwen2.5, Qwen3, and Olmo3 families), and 6 benchmark datasets. Key findings indicate that while higher sequence probability often predicts correctness for prompt-answer pairs within a fixed dataset, this relationship does not reliably extend to decoding decisions. Specifically, increasing sequence probability by adjusting hyperparameters or changing methods does not consistently improve accuracy. Furthermore, sequence probability is not a reliable indicator of correctness for repeated responses to the same prompt. The study also notes that within-dataset correlation between probability and correctness generally increases with the model's overall accuracy on the task.

Key takeaway

For AI Scientists and Machine Learning Engineers optimizing LLM inference, understand that increasing sequence probability through decoding method changes or hyperparameter tuning does not reliably improve answer correctness. Instead, focus on the strong correlation between log-probability and correctness within a dataset when evaluating multiple prompt-answer pairs. When implementing self-consistency, prefer majority voting over probability weighting, as within-sample correlations are often weak. Ensure your model has sufficient task accuracy before attempting probability-based verifier-free self-improvement.

Key insights

Sequence probability reliably predicts correctness only within a fixed dataset, not across decoding methods or hyperparameters.

Principles

Method

The study quantifies probability-correctness alignment across 8 decoding methods, 14 LLM models, and 6 benchmarks, analyzing correlations at within-dataset, within-method, across-method, and within-sample levels.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.