AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

2018-04-17 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The AgentArmor framework addresses critical safety failures in AI coding agents, which stem from underspecification, capability errors, and agent harness issues. Researchers evaluated these failure modes across 8 scenarios, 20 coding environments, and 59 synthetic transcript templates, testing Claude Opus 4.6, GPT 5.4, and Gemini 3.1 Pro with over 500 samples each. AgentArmor, an agent harness modification, mitigates these risks through four key enhancements: an extended system prompt, a command classifier preventing goal drift with a "3 strikes" policy, deterministic guardrails requiring "ls -la" before deletion and script reading before execution, and novel tools enabling agents to manage context and file immutability. The framework demonstrates statistically significant safety improvements.

Key takeaway

For AI Security Engineers deploying or managing coding agents, recognize that current models inherently struggle with safety, often prioritizing task completion over secure defaults. You must implement external, deterministic harness-level controls like AgentArmor's command classifier and guardrails to prevent critical failures such as accidental data deletion or unauthorized deployments. Do not rely on model instructions alone; actively engineer the agent's environment for safety by design.

Key insights

AgentArmor enhances AI coding agent safety by addressing underspecification, capability, and harness errors through a multi-layered, deterministic framework.

Principles

AI agent failures arise from underspecification, capability errors, and harness issues.
Safety by design, using simple deterministic mechanisms, is crucial for unreliable model behaviors.
Over-prompting users for security approvals leads to fatigue and reduced effectiveness.

Method

AgentArmor integrates an extended system prompt, a persuasion-blind command classifier with a "3 strikes" policy, deterministic guardrails for critical actions, and agent-controlled tools for context management and file immutability.

In practice

Deploy a command classifier that is blind to assistant messages to prevent goal drift.
Enforce deterministic guardrails, such as requiring "ls -la" before file deletion.
Equip agents with tools to self-manage context and set file immutability.

Topics

AI Coding Agents
Agent Safety
Failure Modes
Harness Design
Deterministic Guardrails
Context Management

Code references

Best for: AI Architect, Research Scientist, CTO, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.