Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier
Summary
PROPEL, a solver-amortized framework, addresses the bottleneck of generating suitable training tasks for reinforcement learning (RL) agents, particularly in software-engineering (SWE) contexts where solver rollouts are prohibitively expensive. Introduced on June 10, 2026, PROPEL trains a lightweight activation probe once on a corpus of tasks and solver outcomes. This probe then predicts the target-solver pass rate from a frozen generator's hidden states, replacing costly live solver evaluations with a single forward pass during generator optimization. The framework significantly increases the rate of generating "learnable frontier" tasks: for coding, tasks for a Qwen2.5-3B-Instruct solver rose from 10.1% to 20.0%, and for a Qwen2.5-7B-Instruct solver, from 5.3% to 12.6%. In SWE, PROPEL doubled the share of learnable tasks from 9.8% to 19.6% for a Qwen3.5-27B solver on unseen repositories, while requiring less than half the solver trials compared to direct solver-in-the-loop RL.
Key takeaway
For Research Scientists developing advanced RL agents, PROPEL offers a critical method to scale task generation without prohibitive computational costs. You should consider implementing activation probes to amortize expensive solver rollouts, especially for domains like software engineering where evaluation is slow. This approach allows you to efficiently train task generators that produce challenging yet solvable problems, accelerating agent capability development and mitigating the "frontier task supply" bottleneck.
Key insights
Solver-amortized PROPEL uses activation probes to efficiently generate learnable-frontier tasks for RL agents, bypassing expensive solver rollouts.
Principles
- Internal model states encode task utility.
- Amortizing solver cost enables generator RL.
- Target solve rates define learnable task utility.
Method
PROPEL trains a probe on a one-time labeled task corpus and frozen generator activations. This probe then predicts task solve rates, replacing live solver rollouts with a single forward pass during generator RL optimization.
In practice
- Use probes to accelerate task generation.
- Apply worst-case optimization for diversity.
- Consider cold transfer of probes across models.
Topics
- Reinforcement Learning
- Task Generation
- Activation Probes
- Software Engineering
- Language Models
- Model Optimization
Code references
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.