What Makes Interaction Trajectories Effective for Training Terminal Agents?

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Research into training terminal agents reveals a "pedagogical paradox": agents with higher standalone performance, like Claude Opus 4.6 on Terminal-Bench 2.0, are not necessarily superior teachers. Instead, students fine-tuned on trajectories from lower-scoring agents such as DeepSeek-V3.2 demonstrate significantly stronger generalization. This efficacy is attributed to Environment-Grounded Supervision (EGS), where trajectories explicitly expose inspect-act-verify behaviors, fostering robust problem-solving routines. The Terminal-Lego pipeline, which transforms real-world issues into agentic tasks, facilitated this discovery. The study also shows exceptional data efficiency; Qwen3-32B achieved a 24.3% score on Terminal-Bench 2.0 using only 15.3k Terminal-Lego trajectories, matching previous state-of-the-art performance with over 30x less data. This shifts the focus towards "Harness Engineering" for reproducible agentic intelligence.

Key takeaway

For AI Engineers developing terminal agents, focusing solely on teacher agent performance for post-training is misguided. You should prioritize designing training harnesses that explicitly expose environment-grounded "inspect-act-verify" behaviors, rather than just outcome-matching. This "Harness Engineering" approach, leveraging Environment-Grounded Supervision, can yield significantly more generalizable agents with remarkable data efficiency, potentially reducing your training data needs by over 30x.

Key insights

Teaching efficacy for terminal agents stems from environment-grounded interaction trajectories, not just standalone agent performance.

Principles

Method

The Terminal-Lego pipeline transforms multi-domain real-world issues into environment-verified agentic tasks, generating trajectories that expose inspect-act-verify behaviors for student fine-tuning.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.