Reliable Chain-of-Thought via Prefix Consistency

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A new technique called "prefix consistency" significantly improves the accuracy and token efficiency of Large Language Models (LLMs) on reasoning tasks, particularly for difficult problems. This method, introduced by Iwase et al., enhances traditional majority voting (self-consistency) by leveraging a "reproduction-rate asymmetry": correct Chain-of-Thought (CoT) traces are more likely to reproduce their original answer when truncated and regenerated, compared to incorrect traces. Prefix consistency-weighted majority voting (PC-WMV) reweights candidate answers based on this reliability signal, requiring no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, PC-WMV consistently outperforms existing weighted majority voting baselines as a correctness predictor, achieving Standard MV plateau accuracy with up to 21x fewer tokens (median 4.6x). The technique is most effective when a significant "discrimination gap" ($D=r_C-r_W$) exists between the reproduction rates of correct ($r_C$) and wrong ($r_W$) answers.

Key takeaway

For AI Engineers and Research Scientists optimizing LLM reasoning, consider integrating prefix consistency into your weighted majority voting strategies. This method offers substantial token efficiency gains (up to 21x) and improved accuracy, particularly on challenging math and science benchmarks where traditional self-consistency struggles. By reweighting votes based on the reproducibility of truncated CoT traces, you can achieve higher performance with significantly reduced computational cost, especially when the model exhibits a clear "discrimination gap" between correct and incorrect reasoning stability.

Key insights

Truncating and regenerating LLM Chain-of-Thought traces reveals a reliability signal for improved answer aggregation.

Principles

Method

Truncate a Chain-of-Thought trace, regenerate its continuation, and score candidate answers by their reproducibility. Use these scores to weight votes in a majority voting scheme.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.