More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models
Summary
A study on reasoning-capable language models, including DeepSeek-R1 (671B), R1-distilled 7-8B models, and base models prompted with Chain-of-Thought (CoT), reveals that position bias in multiple-choice QA scales with the length of the reasoning trajectory. Across 12 of 13 reasoning-mode configurations on MMLU, ARC-Challenge, and GPQA benchmarks, a positive partial correlation between reasoning length and Position Bias Score (PBS) was observed, ranging from 0.11 to 0.41 ($p<0.05$). A truncation intervention causally demonstrated that continuations from later points in a trajectory are increasingly likely to shift towards position-preferred options (e.g., 16% to 32% for R1-Qwen-7B). While aggregate PBS collapses to 0.019 at 671B, the length effect persists in the longest quartile (PBS = 0.071), suggesting accuracy gates the expression of this bias rather than eliminating its underlying mechanism. Direct-answer position bias is identified as a distinct, length-uncorrelated phenomenon, which CoT reasoning replaces with length-accumulated bias.
Key takeaway
For AI Engineers and Research Scientists evaluating LLMs, do not assume that longer Chain-of-Thought (CoT) outputs are inherently more order-invariant. Implement permutation averaging as a standard practice for reasoning models used in judging or grading pipelines, and consider length-controlled ablations when comparing reasoning and non-reasoning baselines to accurately assess model performance and mitigate length-driven position bias.
Key insights
Longer reasoning trajectories in LLMs increase position bias, a critical factor for evaluation reliability.
Principles
- Position bias scales with reasoning trajectory length.
- Accuracy gates bias expression, not its mechanism.
- Direct-answer bias is distinct from length-accumulated bias.
Method
The study used matched model pairs, cyclic permutation of answer options, and metrics like Position Bias Score (PBS) and Commitment Change Point (CCP), alongside a truncation intervention to establish causality.
In practice
- Use permutation averaging for reasoning models.
- Perform length-controlled ablations in evaluations.
- Apply diagnostic tools (PBS, CCP) before deployment.
Topics
- Position Bias
- Reasoning Trajectory Length
- Chain-of-Thought
- DeepSeek-R1
- Multiple-Choice QA
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.