Disable “Thinking,” Still Get Thousands of Tokens: What Instruct LLMs Are Doing
Summary
Large Language Models (LLMs) are typically categorized as "Instruct" models, optimized for direct user responses, or "Reasoning/Thinking" models, designed for complex problem-solving via extensive intermediate computations. While Thinking models are expected to incur higher inference costs, a trend emerged in late 2025 where Instruct models, such as Qwen3 4B Instruct 2507, showed significant performance jumps on difficult reasoning tasks like AIME and GPQA Diamond. This improvement is attributed to Instruct models silently generating thousands of internal tokens for "thinking" (self-questioning, partial attempts) even when not explicitly in a "Thinking" mode. This behavior inflates inference costs for supposedly cheap Instruct models and distorts benchmark comparisons by obscuring the actual computational budget used, making it difficult to assess true non-thinking performance.
Key takeaway
For VPs of Engineering or Data evaluating LLMs for production, you must consider the hidden "thinking" token generation of "Instruct" models. Your inference costs for these models can be 10x to 20x higher than expected, and benchmark scores may not reflect true efficiency. Implement token generation caps and monitor actual token usage to prevent unexpected cost overruns and ensure accurate performance assessments.
Key insights
Many "Instruct" LLMs implicitly use significant reasoning budgets, skewing benchmark results and increasing inference costs.
Principles
- Benchmark accuracy alone is misleading without token generation data.
- Implicit reasoning budgets vary widely across "Instruct" models.
Method
AIME-Instruct evaluates LLMs by reusing AIME sets with new prompts and rules to isolate non-thinking behavior, comparing models under varying "thinking budgets" including uncapped generation.
In practice
- Cap maximum generation length to control inference spend.
- Monitor token generation for "Instruct" models in production.
Topics
- LLM Reasoning
- Instruct Models
- Inference Cost
- Benchmark Evaluation
- Token Budget
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.