Adaptive Thinking: Large Language Models Know When to Think in Latent Space

2026-04-29 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Sonata (Self-Consistency-Guided Adapter for Thinking Allocation) is a novel, lightweight approach designed to optimize the performance-efficiency tradeoff in large language models (LLMs) by adaptively allocating thinking budgets. It leverages self-consistency, the agreement among multiple reasoning paths, as an indicator of a query's need for extended reasoning. Sonata operates by training an adapter offline on a calibration dataset to predict self-consistency from the last layer hidden representations during the query prefilling stage. This prediction then guides on-the-fly budget allocation before the LLM performs its chain-of-thought reasoning. The adapter is general, transferable across tasks, and introduces negligible computational overhead. Experiments with models like Qwen3-8B and GPT-OSS-120B on benchmarks such as GSM8K and MATH500 show Sonata reduces thinking tokens by 20% to 80% at equivalent accuracy, or improves accuracy by up to 5% at the same token cost.

Key takeaway

For NLP engineers optimizing LLM inference costs, Sonata offers a practical method to significantly reduce computational expenditure without sacrificing accuracy. By adaptively managing thinking budgets based on predicted self-consistency, you can achieve substantial token savings (20-80%) or accuracy gains (up to 5%). Consider integrating Sonata to enhance the efficiency of your CoT-enabled LLM deployments.

Key insights

Self-consistency can predict thinking necessity, enabling adaptive allocation of LLM reasoning budgets.

Principles

Lower self-consistency indicates higher thinking necessity.
Adaptive budget allocation optimizes performance-efficiency.

Method

Train an adapter offline to predict self-consistency from LLM hidden representations during prefilling, then use this prediction to guide on-the-fly thinking budget allocation.

In practice

Integrate Sonata for 20-80% thinking token reduction.
Apply Sonata to improve LLM accuracy by up to 5%.

Topics

Large Language Models
Chain-of-Thought Reasoning
Self-Consistency
Adaptive Thinking Allocation
Sonata Adapter

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.