More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study on reasoning-capable language models, including DeepSeek-R1 (671B), R1-distilled 7-8B models, and base models prompted with Chain-of-Thought (CoT), reveals that position bias in multiple-choice QA scales with the length of the reasoning trajectory. Across 12 of 13 reasoning-mode configurations on MMLU, ARC-Challenge, and GPQA benchmarks, a positive partial correlation between reasoning length and Position Bias Score (PBS) was observed, ranging from 0.11 to 0.41 ($p<0.05$). A truncation intervention causally demonstrated that continuations from later points in a trajectory are increasingly likely to shift towards position-preferred options (e.g., 16% to 32% for R1-Qwen-7B). While aggregate PBS collapses to 0.019 at 671B, the length effect persists in the longest quartile (PBS = 0.071), suggesting accuracy gates the expression of this bias rather than eliminating its underlying mechanism. Direct-answer position bias is identified as a distinct, length-uncorrelated phenomenon, which CoT reasoning replaces with length-accumulated bias.

Key takeaway

For AI Engineers and Research Scientists evaluating LLMs, do not assume that longer Chain-of-Thought (CoT) outputs are inherently more order-invariant. Implement permutation averaging as a standard practice for reasoning models used in judging or grading pipelines, and consider length-controlled ablations when comparing reasoning and non-reasoning baselines to accurately assess model performance and mitigate length-driven position bias.

Key insights

Longer reasoning trajectories in LLMs increase position bias, a critical factor for evaluation reliability.

Principles

Method

The study used matched model pairs, cyclic permutation of answer options, and metrics like Position Bias Score (PBS) and Commitment Change Point (CCP), alongside a truncation intervention to establish causality.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.