Cross-Benchmark Generalization for Long-Horizon Agentic Tasks

2026-05-21 · Source: Surge AI Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

A study on cross-benchmark generalization for long-horizon agentic tasks demonstrates that training a Qwen3.5-122B-A10B model in a specialized Reinforcement Learning (RL) environment significantly improves its performance across diverse external benchmarks. The training pipeline, which includes an SFT stage followed by RL with GSPO, yielded substantial gains: +17.3pp on the in-distribution holdout, +9.6pp on Toolathlon, +5.3pp on τ²-Bench, and +3.5pp on BFCL-V4 at pass@1. Notably, the trained model achieved performance comparable to GPT-5.5 (medium reasoning effort), often within approximately 1pp on Toolathlon and τ²-Bench at pass@1, and even surpassed it on BFCL-V4 at pass@4 (72.2% vs. 69.4%). Key design decisions included using an SFT stage to mitigate reward sparsity and implementing dense rewards from per-criterion graders, which boosted average per-task reward from 0.30 to 0.51. The training also induced beneficial behavioral changes like parallel tool invocation and enhanced task closure.

Key takeaway

For Machine Learning Engineers developing agentic models, evaluating generalization requires moving beyond in-distribution holdouts. You should prioritize testing on diverse, external benchmarks like Toolathlon or τ²-Bench to truly assess capability transfer, not just specialization. If your base model struggles with reward sparsity, consider an SFT stage before RL and implement dense rewards from per-criterion graders; this approach significantly boosts training signal and can yield models competitive with leading proprietary systems.

Key insights

Cross-benchmark evaluation is crucial for assessing genuine agentic capability, revealing transfer beyond training specialization.

Principles

Transferability is key for agentic task evaluation.
Overfitting to training environments is a common failure.
Dense rewards improve RL signal significantly.

Method

An SFT stage precedes RL training with GSPO, using dense rewards derived from per-criterion graders to enhance solvable task surface area and signal.

In practice

Use SFT before RL for sparse reward tasks.
Implement dense rewards from partial completion.
Evaluate on external, disjoint benchmarks.

Topics

Agentic AI
Cross-Benchmark Evaluation
Reinforcement Learning
Supervised Fine-Tuning
Dense Rewards
Tool Use Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.