Auditing Agent Harness Safety

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

HarnessAudit is a new framework designed to evaluate the safety of Large Language Model (LLM) agents within their execution harnesses, focusing on full execution trajectories rather than just final outputs. The framework addresses the critical gap where current safety benchmarks often miss violations occurring mid-trajectory, such as unauthorized resource access or information leakage. HarnessAudit-Bench, a companion benchmark, comprises 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models like ChatGPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, and three multi-agent frameworks, the study found that task completion often misaligns with safe execution, violations increase with trajectory length, and safety risks vary by domain and agent role. Most violations concentrate in resource access and inter-agent information transfer, with multi-agent collaboration expanding the safety risk surface.

Key takeaway

For AI Architects and Research Scientists designing or deploying LLM agent systems, you must shift your safety evaluation from output-level checks to comprehensive trajectory auditing. Focus on implementing robust controls for resource access and inter-agent information transfer, as these are identified as primary failure points. Your harness design critically determines the upper bound of safe deployment, so invest in frameworks that offer strong orchestration and boundary control, especially for multi-agent collaborations, to mitigate risks like sensitive data leakage and unauthorized actions.

Key insights

Agent safety evaluation must audit full execution trajectories, not just final outputs, to detect critical mid-trajectory violations.

Principles

Method

HarnessAudit formalizes agent harnesses as policy-constrained systems, auditing trajectories via hidden, agent-independent evidence channels for tool calls, resource accesses, and inter-component messages across three layers: boundary compliance, execution fidelity, and system stability.

In practice

Topics

Code references

Best for: AI Architect, Research Scientist, CTO, AI Scientist, AI Security Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.