Reliable Chain-of-Thought via Prefix Consistency
Summary
A new technique called "prefix consistency" significantly improves the accuracy and token efficiency of Large Language Models (LLMs) on reasoning tasks, particularly for difficult problems. This method, introduced by Iwase et al., enhances traditional majority voting (self-consistency) by leveraging a "reproduction-rate asymmetry": correct Chain-of-Thought (CoT) traces are more likely to reproduce their original answer when truncated and regenerated, compared to incorrect traces. Prefix consistency-weighted majority voting (PC-WMV) reweights candidate answers based on this reliability signal, requiring no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, PC-WMV consistently outperforms existing weighted majority voting baselines as a correctness predictor, achieving Standard MV plateau accuracy with up to 21x fewer tokens (median 4.6x). The technique is most effective when a significant "discrimination gap" ($D=r_C-r_W$) exists between the reproduction rates of correct ($r_C$) and wrong ($r_W$) answers.
Key takeaway
For AI Engineers and Research Scientists optimizing LLM reasoning, consider integrating prefix consistency into your weighted majority voting strategies. This method offers substantial token efficiency gains (up to 21x) and improved accuracy, particularly on challenging math and science benchmarks where traditional self-consistency struggles. By reweighting votes based on the reproducibility of truncated CoT traces, you can achieve higher performance with significantly reduced computational cost, especially when the model exhibits a clear "discrimination gap" between correct and incorrect reasoning stability.
Key insights
Truncating and regenerating LLM Chain-of-Thought traces reveals a reliability signal for improved answer aggregation.
Principles
- Correct reasoning traces are more reproducible under regeneration.
- Reliability signals should separate correct from wrong traces, especially on difficult problems.
Method
Truncate a Chain-of-Thought trace, regenerate its continuation, and score candidate answers by their reproducibility. Use these scores to weight votes in a majority voting scheme.
In practice
- Implement PC-WMV with $w(c)=c^3$ for optimal performance on difficult problems.
- Focus on problems with a large discrimination gap ($D$) for maximum token efficiency gains.
Topics
- Prefix Consistency
- Chain-of-Thought
- Weighted Majority Voting
- Large Language Models
- Reasoning Benchmarks
Code references
- naoto-iwase/prefix-consistency
- facebookresearch/deepconf
- google-research/google-research
- Pranjal2041/AdaptiveConsistency
- Yiwei98/ESC
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.