Training Deliberative Monitors for Black-Box Scheming Detection

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Action-only deliberative monitors, smaller open-weight models, are proposed to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or internal model states. This method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors using supervised fine-tuning and reinforcement learning. Evaluated across six out-of-distribution agentic misalignment benchmarks, applying this method to Qwen3.5-27B yields higher performance than low-cost frontier models like Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5, and also outperforms Gemini 2.5 Pro, all while achieving lower marginal inference cost. While stronger frontier models (e.g., Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6) achieve higher performance, they incur roughly $16$--$34\times$ higher marginal inference costs. Several trained monitors are positioned on the empirical cost--performance Pareto frontier, offering practical, low-cost, low-FPR alternatives.

Key takeaway

For AI Security Engineers evaluating agentic system safety or deploying cost-effective monitoring, action-only deliberative monitors offer a viable, low-cost alternative to expensive frontier models for detecting black-box scheming. You should consider training open-weight models like Qwen3.5-27B with this method to achieve strong performance and lower marginal inference costs. This approach provides practical, low-FPR solutions, especially for out-of-distribution misalignment benchmarks, without needing internal model access.

Key insights

Detecting agent scheming without internal access is achievable using distilled rationales for cost-effective, high-performing monitors.

Principles

Scheming detection is a central AI control problem.
Action-only monitors offer cost-effective black-box detection.
Deliberative alignment can distill teacher rationales.

Method

Elicit structured rationales from a frontier teacher using a scheming specification, filter with a judge, then distill into open-weight monitors via supervised fine-tuning and reinforcement learning.

In practice

Train Qwen3.5-27B for black-box scheming detection.
Use distilled rationales from frontier models.
Deploy low-FPR monitors on a budget.

Topics

Black-Box Scheming Detection
Deliberative Monitors
Agentic Misalignment
Supervised Fine-Tuning
Reinforcement Learning
Qwen3.5-27B

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.