Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

2025-09-25 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

The paper introduces OPT★, a scalable family of optimization-style tasks designed to train and evaluate Large Language Models' (LLMs) step-by-step optimization-like reasoning. OPT★ tasks feature a complexity parameter α that expands the search space without requiring new human labels, alongside inexpensive feasibility checkers and outcome evaluators. The research explores two training regimes: solver-guided online policy optimization, which uses a solver as a value oracle for partial states, and search-based offline RL, for when such solvers are unavailable. Empirically, training on OPT★ improves reasoning, with specific search components like feasibility pruning and duplicate-action merging enhancing efficiency. Experiments used Qwen2.5-3B, Llama-3.2-3B, and Qwen2.5-7B models, generating 1,000 training instances per α level across four complexity levels.

Key takeaway

For Machine Learning Engineers developing LLMs for complex decision-making, you should consider integrating optimization-based training tasks like OPT★. This approach allows you to scale reasoning difficulty without manual labeling, significantly improving model performance on constrained, multi-step problems. Implement search enhancements such as feasibility pruning and duplicate-action merging, or leverage solver-guided online reinforcement learning to provide robust, verifiable feedback. This will enable your models to consistently find higher-value solutions in real-world applications like scheduling or resource allocation.

Key insights

OPT★ provides scalable, auto-verifiable optimization tasks to train LLMs for complex step-by-step reasoning.

Principles

Optimization tasks offer scalable, auto-verifiable supervision.
Increasing search space complexity requires more information per budget.
Feasibility checks and duplicate merging boost search efficiency.

Method

Offline RL uses search to find and fine-tune on best trajectories, incorporating feasibility pruning and duplicate-action merging. Online RL uses a solver as a value oracle for partial states, applying rank-based reward shaping.

In practice

Apply feasibility pruning in LLM search for constrained problems.
Use rank-based reward shaping with solvers for online policy optimization.
Train LLMs on optimization tasks for better generalization to spatial reasoning.

Topics

Large Language Models
Optimization Reasoning
Reinforcement Learning
Search Algorithms
Verifiable Training
Curriculum Learning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.