When are likely answers right? On Sequence Probability and Correctness in LLMs

2026-06-25 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study quantifies the relationship between sequence probability and correctness in large language models (LLMs) across various decoding methods, hyperparameters, and prompt-answer pairs. Researchers found that higher sequence probability often predicts correctness when comparing different prompt-answer pairs within a fixed dataset. However, this correlation does not reliably extend to decoding decisions; increasing sequence probability by altering hyperparameters or methods does not consistently improve accuracy. Furthermore, sequence probability proves to be an unreliable indicator of correctness for repeated responses generated from the same prompt. These findings offer crucial clarity on when decoding strategies can genuinely enhance LLM correctness and provide practical guidance for self-consistency and verifier-free self-improvement techniques.

Key takeaway

For ML Engineers and AI Scientists optimizing LLM outputs, you should reconsider relying solely on increasing sequence probability through decoding methods or hyperparameters to boost accuracy. While sequence probability can indicate correctness within a dataset, it does not reliably transfer to improving model performance via decoding changes. Focus your efforts on strategies that address the fundamental alignment of model likelihood with truth, rather than assuming higher probability always means better answers, especially for repeated generations.

Key insights

Higher sequence probability often predicts correctness within datasets, but not across decoding decisions or repeated responses.

Principles

Sequence probability predicts correctness across prompt-answer pairs within fixed datasets.
Increasing sequence probability via decoding changes does not reliably improve accuracy.
Sequence probability is not a good correctness indicator for repeated responses to the same prompt.

In practice

Re-evaluate decoding strategies.
Inform self-consistency approaches.
Guide verifier-free self-improvement.

Topics

Large Language Models
Decoding Methods
Sequence Probability
Model Correctness
Self-consistency
Verifier-free Self-improvement

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.