Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies · Depth: Expert, extended

Summary

A study investigating Large Reasoning Models (LRMs) found that inference-time reasoning effort does not modulate the alignment between LRM token usage and human cognitive costs. Researchers tested GPT-OSS-20B and GPT-OSS-120B across three effort levels (low, medium, high) and six reasoning tasks, including arithmetic, formal logic, and relational reasoning. The alignment between LRM chain-of-thought (CoT) trace length and human reaction times remained invariant, with Bayes Factors leaning towards the null hypothesis and mean alignment being numerically near-identical across conditions. A manipulation check revealed that the "reasoning_effort" parameter acts as an upper budget for token generation rather than a real-time allocation dial. Arithmetic complexity contrasts further showed that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving this match. These findings suggest that cognitive cost alignment is a training-time achievement, robust to inference-time perturbations.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or deploying LRMs, understand that the alignment of model reasoning costs with human cognition is primarily determined during training, not by inference-time effort settings. Your efforts to achieve human-like cognitive cost scaling should focus on reinforcement learning with verifiable rewards (RLVR) training objectives, as post-training adjustments to reasoning effort parameters will likely only set a token budget rather than dynamically reconfigure the model's problem-solving policy.

Key insights

Human-LMM cognitive cost alignment is a stable, training-time achievement, robust to inference-time effort changes.

Principles

Method

Evaluated GPT-OSS-20B and GPT-OSS-120B under three reasoning effort conditions across six tasks. Measured within-task and cross-task Pearson correlations between log-transformed token counts and human RTs, using Bayesian paired-samples t-tests for effort invariance.

In practice

Topics

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.