Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
Summary
A study investigating Large Reasoning Models (LRMs) found that inference-time reasoning effort does not modulate the alignment between LRM token usage and human cognitive costs. Researchers tested GPT-OSS-20B and GPT-OSS-120B across three effort levels (low, medium, high) and six reasoning tasks, including arithmetic, formal logic, and relational reasoning. The alignment between LRM chain-of-thought (CoT) trace length and human reaction times remained invariant, with Bayes Factors leaning towards the null hypothesis and mean alignment being numerically near-identical across conditions. A manipulation check revealed that the "reasoning_effort" parameter acts as an upper budget for token generation rather than a real-time allocation dial. Arithmetic complexity contrasts further showed that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving this match. These findings suggest that cognitive cost alignment is a training-time achievement, robust to inference-time perturbations.
Key takeaway
For AI Scientists and Machine Learning Engineers developing or deploying LRMs, understand that the alignment of model reasoning costs with human cognition is primarily determined during training, not by inference-time effort settings. Your efforts to achieve human-like cognitive cost scaling should focus on reinforcement learning with verifiable rewards (RLVR) training objectives, as post-training adjustments to reasoning effort parameters will likely only set a token budget rather than dynamically reconfigure the model's problem-solving policy.
Key insights
Human-LMM cognitive cost alignment is a stable, training-time achievement, robust to inference-time effort changes.
Principles
- Alignment is a structural property, not a calibration-sensitive artifact.
- Model scale improves fidelity of tracking human difficulty patterns.
Method
Evaluated GPT-OSS-20B and GPT-OSS-120B under three reasoning effort conditions across six tasks. Measured within-task and cross-task Pearson correlations between log-transformed token counts and human RTs, using Bayesian paired-samples t-tests for effort invariance.
In practice
- Focus on training objectives for human-aligned cost allocation.
- Inference-time effort parameters act as budgets, not dynamic controls.
Topics
- Large Reasoning Models
- Cognitive Cost Alignment
- Chain-of-Thought
- Effort Invariance
- Arithmetic Cognition
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.