Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
Summary
A new method addresses the challenge of allocating test-time compute for large language models (LLMs) under finite inference budgets, formalizing it as a constrained optimization problem to maximize expected accuracy while adhering to an average compute budget. The proposed two-stage "Solve-then-Learn" pipeline first uses Lagrangian relaxation to decompose the global constraint into per-instance sub-problems, each with a closed-form oracle action that optimally balances accuracy and cost. This stage leverages binary search for exact budget targeting, as the induced cost is monotone in the dual variable. The second stage trains a lightweight classifier to predict these oracle actions from inexpensive input features, enabling real-time deployment. Experiments on MATH and GSM8K datasets with DeepSeek-V3, GPT-4o-mini, and Qwen2.5-7B LLMs demonstrate up to a 12.8% relative accuracy improvement on MATH compared to uniform and heuristic baselines, achieving over 91% imitation accuracy of the Lagrangian oracle.
Key takeaway
For MLOps Engineers deploying LLMs with test-time compute scaling, this method offers a principled way to optimize resource allocation. You should consider implementing a "Solve-then-Learn" pipeline to dynamically assign compute based on input complexity, potentially achieving significant accuracy gains (e.g., 12.8% on MATH) within your existing budget constraints. This approach provides a robust alternative to uniform or heuristic allocation strategies.
Key insights
Optimizing LLM test-time compute involves balancing accuracy and cost via a constrained policy.
Principles
- Decompose global constraints into per-instance sub-problems.
- Monotonic cost enables exact budget targeting.
- Imitation learning can amortize optimal allocation rules.
Method
A two-stage "Solve-then-Learn" pipeline uses Lagrangian relaxation for optimal per-instance compute pricing, followed by training a classifier to predict these optimal actions from cheap input features for real-time deployment.
In practice
- Apply to LLM reasoning tasks like MATH and GSM8K.
- Improve accuracy under fixed compute budgets.
- Deploy lightweight classifiers for real-time allocation.
Topics
- Adaptive Compute Allocation
- Large Language Models
- Constrained Policy Optimization
- Lagrangian Relaxation
- Test-Time Compute Scaling
Best for: Research Scientist, MLOps Engineer, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.