More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

2026-05-11 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study on reasoning-capable language models, including DeepSeek-R1 (671B), R1-distilled 7-8B models, and base models prompted with Chain-of-Thought (CoT), reveals that position bias in multiple-choice QA scales with the length of the reasoning trajectory. Across 12 of 13 reasoning-mode configurations on MMLU, ARC-Challenge, and GPQA benchmarks, a positive partial correlation between reasoning length and Position Bias Score (PBS) was observed, ranging from 0.11 to 0.41 ($p<0.05$). A truncation intervention causally demonstrated that continuations from later points in a trajectory are increasingly likely to shift towards position-preferred options (e.g., 16% to 32% for R1-Qwen-7B). While aggregate PBS collapses to 0.019 at 671B, the length effect persists in the longest quartile (PBS = 0.071), suggesting accuracy gates the expression of this bias rather than eliminating its underlying mechanism. Direct-answer position bias is identified as a distinct, length-uncorrelated phenomenon, which CoT reasoning replaces with length-accumulated bias.

Key takeaway

For AI Engineers and Research Scientists evaluating LLMs, do not assume that longer Chain-of-Thought (CoT) outputs are inherently more order-invariant. Implement permutation averaging as a standard practice for reasoning models used in judging or grading pipelines, and consider length-controlled ablations when comparing reasoning and non-reasoning baselines to accurately assess model performance and mitigate length-driven position bias.

Key insights

Longer reasoning trajectories in LLMs increase position bias, a critical factor for evaluation reliability.

Principles

Position bias scales with reasoning trajectory length.
Accuracy gates bias expression, not its mechanism.
Direct-answer bias is distinct from length-accumulated bias.

Method

The study used matched model pairs, cyclic permutation of answer options, and metrics like Position Bias Score (PBS) and Commitment Change Point (CCP), alongside a truncation intervention to establish causality.

In practice

Use permutation averaging for reasoning models.
Perform length-controlled ablations in evaluations.
Apply diagnostic tools (PBS, CCP) before deployment.

Topics

Position Bias
Reasoning Trajectory Length
Chain-of-Thought
DeepSeek-R1
Multiple-Choice QA

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.