TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents
Summary
Released on June 26, 2026, TUA-Bench is a new general-purpose benchmark for terminal-use agents (TUAs), developed by Meta AI, Duke University, and Stanford University. It addresses a critical gap in existing evaluations, which primarily target graphical user interfaces or programming-specific shell tasks. TUA-Bench features 120 manually designed, real-world tasks across five families, encompassing routine digital activities like document editing and email management, alongside specialized scientific and engineering workflows co-designed with PhD-level experts. Tasks run in isolated Linux containers with deterministic setup scripts and are evaluated via an execution-based scoring protocol. Initial evaluations show that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieved only 65.8% overall performance, revealing substantial remaining challenges for general-purpose terminal agents.
Key takeaway
For AI Scientists and ML Engineers developing general-purpose LLM agents, you should integrate TUA-Bench into your evaluation pipelines. This benchmark provides a realistic, text-native environment to assess agent capabilities beyond GUIs or coding. Focus on improving long-horizon planning, tool use, and error recovery, as current frontier models achieve only 65.8% success. Optimize for cost-performance by balancing reasoning effort and carefully selecting agent scaffolds, as these significantly influence overall results.
Key insights
Terminal-use agents require specialized benchmarks like TUA-Bench to evaluate general computer-use capabilities beyond GUIs or coding.
Principles
- CLI environments align with LLM strengths.
- Diverse, realistic tasks reveal agent gaps.
- Agent scaffold impacts model performance.
Method
TUA-Bench tasks are manually curated from GUI benchmarks (everyday tasks) or co-designed with PhD experts (professional tasks), run in isolated Linux containers, and evaluated via execution-based scoring.
In practice
- Use 2400s task execution time limit.
- Medium-to-high reasoning effort balances cost.
- Evaluate models across different agent scaffolds.
Topics
- Terminal-Use Agents
- LLM Benchmarking
- Command-Line Interfaces
- Agent Evaluation
- Multi-step Workflows
- Cost-Performance Trade-offs
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.