Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

The paper introduces OPT★, a scalable family of optimization-style tasks designed to train and evaluate Large Language Models' (LLMs) step-by-step optimization-like reasoning. OPT★ tasks feature a complexity parameter α that expands the search space without requiring new human labels, alongside inexpensive feasibility checkers and outcome evaluators. The research explores two training regimes: solver-guided online policy optimization, which uses a solver as a value oracle for partial states, and search-based offline RL, for when such solvers are unavailable. Empirically, training on OPT★ improves reasoning, with specific search components like feasibility pruning and duplicate-action merging enhancing efficiency. Experiments used Qwen2.5-3B, Llama-3.2-3B, and Qwen2.5-7B models, generating 1,000 training instances per α level across four complexity levels.

Key takeaway

For Machine Learning Engineers developing LLMs for complex decision-making, you should consider integrating optimization-based training tasks like OPT★. This approach allows you to scale reasoning difficulty without manual labeling, significantly improving model performance on constrained, multi-step problems. Implement search enhancements such as feasibility pruning and duplicate-action merging, or leverage solver-guided online reinforcement learning to provide robust, verifiable feedback. This will enable your models to consistently find higher-value solutions in real-world applications like scheduling or resource allocation.

Key insights

OPT★ provides scalable, auto-verifiable optimization tasks to train LLMs for complex step-by-step reasoning.

Principles

Method

Offline RL uses search to find and fine-tune on best trajectories, incorporating feasibility pruning and duplicate-action merging. Online RL uses a solver as a value oracle for partial states, applying rank-based reward shaping.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.