Disable “Thinking,” Still Get Thousands of Tokens: What Instruct LLMs Are Doing

2026-03-02 · Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

Large Language Models (LLMs) are typically categorized as "Instruct" models, optimized for direct user responses, or "Reasoning/Thinking" models, designed for complex problem-solving via extensive intermediate computations. While Thinking models are expected to incur higher inference costs, a trend emerged in late 2025 where Instruct models, such as Qwen3 4B Instruct 2507, showed significant performance jumps on difficult reasoning tasks like AIME and GPQA Diamond. This improvement is attributed to Instruct models silently generating thousands of internal tokens for "thinking" (self-questioning, partial attempts) even when not explicitly in a "Thinking" mode. This behavior inflates inference costs for supposedly cheap Instruct models and distorts benchmark comparisons by obscuring the actual computational budget used, making it difficult to assess true non-thinking performance.

Key takeaway

For VPs of Engineering or Data evaluating LLMs for production, you must consider the hidden "thinking" token generation of "Instruct" models. Your inference costs for these models can be 10x to 20x higher than expected, and benchmark scores may not reflect true efficiency. Implement token generation caps and monitor actual token usage to prevent unexpected cost overruns and ensure accurate performance assessments.

Key insights

Many "Instruct" LLMs implicitly use significant reasoning budgets, skewing benchmark results and increasing inference costs.

Principles

Benchmark accuracy alone is misleading without token generation data.
Implicit reasoning budgets vary widely across "Instruct" models.

Method

AIME-Instruct evaluates LLMs by reusing AIME sets with new prompts and rules to isolate non-thinking behavior, comparing models under varying "thinking budgets" including uncapped generation.

In practice

Cap maximum generation length to control inference spend.
Monitor token generation for "Instruct" models in production.

Topics

LLM Reasoning
Instruct Models
Inference Cost
Benchmark Evaluation
Token Budget

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.