TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

2023-10-11 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

Released on June 26, 2026, TUA-Bench is a new general-purpose benchmark for terminal-use agents (TUAs), developed by Meta AI, Duke University, and Stanford University. It addresses a critical gap in existing evaluations, which primarily target graphical user interfaces or programming-specific shell tasks. TUA-Bench features 120 manually designed, real-world tasks across five families, encompassing routine digital activities like document editing and email management, alongside specialized scientific and engineering workflows co-designed with PhD-level experts. Tasks run in isolated Linux containers with deterministic setup scripts and are evaluated via an execution-based scoring protocol. Initial evaluations show that the strongest frontier agent, Claude Code with Claude Opus 4.8 max reasoning effort, achieved only 65.8% overall performance, revealing substantial remaining challenges for general-purpose terminal agents.

Key takeaway

For AI Scientists and ML Engineers developing general-purpose LLM agents, you should integrate TUA-Bench into your evaluation pipelines. This benchmark provides a realistic, text-native environment to assess agent capabilities beyond GUIs or coding. Focus on improving long-horizon planning, tool use, and error recovery, as current frontier models achieve only 65.8% success. Optimize for cost-performance by balancing reasoning effort and carefully selecting agent scaffolds, as these significantly influence overall results.

Key insights

Terminal-use agents require specialized benchmarks like TUA-Bench to evaluate general computer-use capabilities beyond GUIs or coding.

Principles

CLI environments align with LLM strengths.
Diverse, realistic tasks reveal agent gaps.
Agent scaffold impacts model performance.

Method

TUA-Bench tasks are manually curated from GUI benchmarks (everyday tasks) or co-designed with PhD experts (professional tasks), run in isolated Linux containers, and evaluated via execution-based scoring.

In practice

Use 2400s task execution time limit.
Medium-to-high reasoning effort balances cost.
Evaluate models across different agent scaffolds.

Topics

Terminal-Use Agents
LLM Benchmarking
Command-Line Interfaces
Agent Evaluation
Multi-step Workflows
Cost-Performance Trade-offs

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.