Training Deliberative Monitors for Black-Box Scheming Detection
Summary
Action-only deliberative monitors, smaller open-weight models, are proposed to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or internal model states. This method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors using supervised fine-tuning and reinforcement learning. Evaluated across six out-of-distribution agentic misalignment benchmarks, applying this method to Qwen3.5-27B yields higher performance than low-cost frontier models like Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5, and also outperforms Gemini 2.5 Pro, all while achieving lower marginal inference cost. While stronger frontier models (e.g., Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.6) achieve higher performance, they incur roughly $16$--$34\times$ higher marginal inference costs. Several trained monitors are positioned on the empirical cost--performance Pareto frontier, offering practical, low-cost, low-FPR alternatives.
Key takeaway
For AI Security Engineers evaluating agentic system safety or deploying cost-effective monitoring, action-only deliberative monitors offer a viable, low-cost alternative to expensive frontier models for detecting black-box scheming. You should consider training open-weight models like Qwen3.5-27B with this method to achieve strong performance and lower marginal inference costs. This approach provides practical, low-FPR solutions, especially for out-of-distribution misalignment benchmarks, without needing internal model access.
Key insights
Detecting agent scheming without internal access is achievable using distilled rationales for cost-effective, high-performing monitors.
Principles
- Scheming detection is a central AI control problem.
- Action-only monitors offer cost-effective black-box detection.
- Deliberative alignment can distill teacher rationales.
Method
Elicit structured rationales from a frontier teacher using a scheming specification, filter with a judge, then distill into open-weight monitors via supervised fine-tuning and reinforcement learning.
In practice
- Train Qwen3.5-27B for black-box scheming detection.
- Use distilled rationales from frontier models.
- Deploy low-FPR monitors on a budget.
Topics
- Black-Box Scheming Detection
- Deliberative Monitors
- Agentic Misalignment
- Supervised Fine-Tuning
- Reinforcement Learning
- Qwen3.5-27B
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.