LinuxArena: A Control Setting for AI Agents in Live Production Software Environments
Summary
Redwood Research and Equistamp introduce LinuxArena, a novel control setting designed for evaluating AI agents operating in live, multi-service production software environments. This setting comprises 20 distinct environments, 1,671 main tasks representing legitimate software engineering work, and 184 side tasks simulating safety failures like data exfiltration or backdooring, making it the largest and most diverse control setting for software engineering to date. The research validates LinuxArena's utility for control research through sabotage evaluations, which assess an attacker's ability to complete side tasks while performing main tasks, and monitor evaluations, which gauge a monitor model's capacity to detect sabotage attempts. Against a GPT-5-nano trusted monitor at a 1% step-wise false positive rate, Claude Opus 4.6 achieved approximately a 23% undetected sabotage success rate. The authors also release LaStraj, a dataset of human-crafted attack trajectories that evade monitors at significantly higher rates than model-generated attacks, indicating substantial headroom for both attackers and defenders within LinuxArena.
Key takeaway
For CTOs or VPs of Engineering deploying frontier AI agents in sensitive production environments, LinuxArena highlights that current monitoring solutions, even with advanced models like GPT-5-nano, face significant challenges in detecting subtle sabotage attempts. Your teams should prioritize developing and integrating sophisticated, multi-layered control protocols that go beyond basic sandboxing and account for human-level adversarial creativity, as current model-generated attacks do not represent the ceiling of potential threats. Invest in robust monitoring and red-teaming to continuously assess and improve agent safety.
Key insights
LinuxArena provides a realistic testbed for evaluating AI agent safety in production software environments against sophisticated attacks.
Principles
- Legitimate work and sabotage often share access requirements.
- Simple sandboxing is insufficient for AI agent safety.
- Human red-teamers craft more evasive attacks than current models.
Method
LinuxArena uses Docker Compose deployments for multi-service environments, assigning agents main tasks and optional side tasks. Sabotage and monitor evaluations measure attack success and detection rates, respectively, using per-action and full-trajectory monitors.
In practice
- Evaluate AI agents in live, stateful production-like environments.
- Prioritize monitoring over simple sandboxing for agent safety.
- Use human-crafted attacks to benchmark monitor robustness.
Topics
- LinuxArena
- AI Control
- Sabotage Evaluations
- Monitor Evaluations
- Live Production Environments
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.