Cross-Benchmark Generalization for Long-Horizon Agentic Tasks
Summary
A study on cross-benchmark generalization for long-horizon agentic tasks demonstrates that training a Qwen3.5-122B-A10B model in a specialized Reinforcement Learning (RL) environment significantly improves its performance across diverse external benchmarks. The training pipeline, which includes an SFT stage followed by RL with GSPO, yielded substantial gains: +17.3pp on the in-distribution holdout, +9.6pp on Toolathlon, +5.3pp on τ²-Bench, and +3.5pp on BFCL-V4 at pass@1. Notably, the trained model achieved performance comparable to GPT-5.5 (medium reasoning effort), often within approximately 1pp on Toolathlon and τ²-Bench at pass@1, and even surpassed it on BFCL-V4 at pass@4 (72.2% vs. 69.4%). Key design decisions included using an SFT stage to mitigate reward sparsity and implementing dense rewards from per-criterion graders, which boosted average per-task reward from 0.30 to 0.51. The training also induced beneficial behavioral changes like parallel tool invocation and enhanced task closure.
Key takeaway
For Machine Learning Engineers developing agentic models, evaluating generalization requires moving beyond in-distribution holdouts. You should prioritize testing on diverse, external benchmarks like Toolathlon or τ²-Bench to truly assess capability transfer, not just specialization. If your base model struggles with reward sparsity, consider an SFT stage before RL and implement dense rewards from per-criterion graders; this approach significantly boosts training signal and can yield models competitive with leading proprietary systems.
Key insights
Cross-benchmark evaluation is crucial for assessing genuine agentic capability, revealing transfer beyond training specialization.
Principles
- Transferability is key for agentic task evaluation.
- Overfitting to training environments is a common failure.
- Dense rewards improve RL signal significantly.
Method
An SFT stage precedes RL training with GSPO, using dense rewards derived from per-criterion graders to enhance solvable task surface area and signal.
In practice
- Use SFT before RL for sparse reward tasks.
- Implement dense rewards from partial completion.
- Evaluate on external, disjoint benchmarks.
Topics
- Agentic AI
- Cross-Benchmark Evaluation
- Reinforcement Learning
- Supervised Fine-Tuning
- Dense Rewards
- Tool Use Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.