Reasoning Budgets vs. Structured CoT: Controlling LLM Thinking Tokens

· Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, quick

Summary

An analysis evaluates the performance of Qwen3.6 27B, Qwen3.5 27B, and Gemma 4 31B, focusing on the practical serving problem posed by Large Language Model (LLM) reasoning traces. These intermediate text outputs, which can exceed 100k tokens for models like Qwen3.5/3.6 compared to Gemma 4's typical <20k, significantly consume tokens, latency, KV cache, and financial resources. Many strong open models lack native controls for these "thinking tokens." The evaluation explores two decoding-time methods to impose constraints without retraining: forcing a reasoning budget by injecting a `</thought>` tag, and constraining the trace with a Backus-Naur Form (BNF) grammar. The study specifically assesses Qwen3.6 27B on coding, hard math, and hard science multiple-choice benchmarks to determine token reduction, accuracy impact, and whether constrained thinking performs better than both "thinking on" and "thinking off" scenarios.

Key takeaway

For MLOps Engineers managing LLM inference costs, understanding and controlling reasoning traces is crucial. Unchecked "thinking tokens" can drastically increase operational expenses and latency. You should investigate implementing decoding-time controls, such as forcing reasoning budgets or applying BNF grammars, using frameworks like vLLM or llama.cpp. Carefully evaluate the trade-offs between token efficiency and accuracy, as these methods introduce traces the model wasn't trained on, potentially impacting performance.

Key insights

Controlling LLM reasoning traces via budget or grammar can reduce token cost but risks accuracy.

Principles

Method

Impose reasoning budgets by forcing a closing tag or constrain traces with a BNF grammar, both applied at decoding time without retraining.

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.