What Makes Interaction Trajectories Effective for Training Terminal Agents?
Summary
Research into training terminal agents reveals a "pedagogical paradox": agents with higher standalone performance, like Claude Opus 4.6 on Terminal-Bench 2.0, are not necessarily superior teachers. Instead, students fine-tuned on trajectories from lower-scoring agents such as DeepSeek-V3.2 demonstrate significantly stronger generalization. This efficacy is attributed to Environment-Grounded Supervision (EGS), where trajectories explicitly expose inspect-act-verify behaviors, fostering robust problem-solving routines. The Terminal-Lego pipeline, which transforms real-world issues into agentic tasks, facilitated this discovery. The study also shows exceptional data efficiency; Qwen3-32B achieved a 24.3% score on Terminal-Bench 2.0 using only 15.3k Terminal-Lego trajectories, matching previous state-of-the-art performance with over 30x less data. This shifts the focus towards "Harness Engineering" for reproducible agentic intelligence.
Key takeaway
For AI Engineers developing terminal agents, focusing solely on teacher agent performance for post-training is misguided. You should prioritize designing training harnesses that explicitly expose environment-grounded "inspect-act-verify" behaviors, rather than just outcome-matching. This "Harness Engineering" approach, leveraging Environment-Grounded Supervision, can yield significantly more generalizable agents with remarkable data efficiency, potentially reducing your training data needs by over 30x.
Key insights
Teaching efficacy for terminal agents stems from environment-grounded interaction trajectories, not just standalone agent performance.
Principles
- Standalone agent performance does not dictate teaching efficacy.
- Environment-Grounded Supervision (EGS) fosters robust problem-solving.
- "Harness Engineering" is key for generalizable agentic intelligence.
Method
The Terminal-Lego pipeline transforms multi-domain real-world issues into environment-verified agentic tasks, generating trajectories that expose inspect-act-verify behaviors for student fine-tuning.
In practice
- Prioritize trajectory quality over teacher agent score.
- Design harnesses to reveal inspect-act-verify steps.
- Explore EGS for data-efficient agent post-training.
Topics
- Terminal Agents
- Agent Training
- Environment-Grounded Supervision
- Harness Engineering
- Data Efficiency
- Code Agents
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.