Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

2026-04-16 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new method addresses the challenge of allocating test-time compute for large language models (LLMs) under finite inference budgets, formalizing it as a constrained optimization problem to maximize expected accuracy while adhering to an average compute budget. The proposed two-stage "Solve-then-Learn" pipeline first uses Lagrangian relaxation to decompose the global constraint into per-instance sub-problems, each with a closed-form oracle action that optimally balances accuracy and cost. This stage leverages binary search for exact budget targeting, as the induced cost is monotone in the dual variable. The second stage trains a lightweight classifier to predict these oracle actions from inexpensive input features, enabling real-time deployment. Experiments on MATH and GSM8K datasets with DeepSeek-V3, GPT-4o-mini, and Qwen2.5-7B LLMs demonstrate up to a 12.8% relative accuracy improvement on MATH compared to uniform and heuristic baselines, achieving over 91% imitation accuracy of the Lagrangian oracle.

Key takeaway

For MLOps Engineers deploying LLMs with test-time compute scaling, this method offers a principled way to optimize resource allocation. You should consider implementing a "Solve-then-Learn" pipeline to dynamically assign compute based on input complexity, potentially achieving significant accuracy gains (e.g., 12.8% on MATH) within your existing budget constraints. This approach provides a robust alternative to uniform or heuristic allocation strategies.

Key insights

Optimizing LLM test-time compute involves balancing accuracy and cost via a constrained policy.

Principles

Decompose global constraints into per-instance sub-problems.
Monotonic cost enables exact budget targeting.
Imitation learning can amortize optimal allocation rules.

Method

A two-stage "Solve-then-Learn" pipeline uses Lagrangian relaxation for optimal per-instance compute pricing, followed by training a classifier to predict these optimal actions from cheap input features for real-time deployment.

In practice

Apply to LLM reasoning tasks like MATH and GSM8K.
Improve accuracy under fixed compute budgets.
Deploy lightweight classifiers for real-time allocation.

Topics

Adaptive Compute Allocation
Large Language Models
Constrained Policy Optimization
Lagrangian Relaxation
Test-Time Compute Scaling

Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.