Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A new analysis of human-in-the-loop approval gates for LLM agents challenges conventional assumptions about "risky" actions and human reviewer capacity. Researchers found that human reviewers exhibit only moderate agreement on what constitutes a risky action (Fleiss' kappa = 0.52) across a hand-labeled set of 125 adversarially-weighted agent actions, indicating no single ground truth. The study frames agent oversight as a selective classification problem with asymmetric costs, revealing measurable operating limits where guards cannot safely auto-decide on complex inputs. Crucially, when human reviewers are modeled as fatiguing with increased escalation load, realized safety follows an inverted-U curve, meaning excessive oversight can paradoxically reduce safety. The safety-optimal guard escalates below full capacity, a strategy also effective against flooding attacks designed to exploit fatigued reviewers. This work introduces an open-source agent-oversight system to operationalize and measure these dynamics, transforming guard evaluation from a guess into a quantifiable curve.

Key takeaway

For MLOps Engineers designing human-in-the-loop safety gates for LLM agents, you must account for human reviewer fatigue and subjective risk assessment. Your escalation policies should be load-aware, potentially escalating below full capacity to prevent an inverted-U safety curve where more oversight reduces overall safety. Implement mechanisms to resist flooding attacks that exploit fatigued reviewers, ensuring your guard's effectiveness is measured against realistic human limitations.

Key insights

Human oversight capacity is finite and fatigues, making agent safety a resource-allocation problem.

Principles

"Risky" action judgment is subjective (Fleiss' kappa = 0.52).
Excessive human oversight can decrease system safety.
Agent oversight is a human attention resource problem.

Method

The proposed open-source agent-oversight system operationalizes and measures human fatigue and escalation dynamics, framing guard evaluation as a quantifiable curve rather than a guess.

In practice

Calibrate agent guards to human fatigue limits.
Implement load-aware escalation policies.
Guard against reviewer flooding attacks.

Topics

LLM Agent Safety
Human-in-the-Loop
Reviewer Fatigue
Agent Oversight Systems
Risk Assessment
Flooding Attacks

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.