Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Cybersecurity & Data Privacy · Depth: Advanced, quick

Summary

The MOSAIC framework is a post-training method designed to align agentic language models for safe multi-step tool use, addressing limitations of existing alignment techniques in sequential decision-making and adversarial tool feedback. Agentic models, unlike chat models, can execute long-horizon actions, making safety critical due to potential irreversible harm from missteps like file access or credential entry. MOSAIC structures inference as a "plan, check, then act or refuse" loop, integrating explicit safety reasoning and refusal as primary actions. It utilizes preference-based reinforcement learning with pairwise trajectory comparisons for training, bypassing the need for trajectory-level labels and capturing subtle safety distinctions. Evaluated zero-shot across Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4 model families, MOSAIC reduced harmful behavior by up to 50%, increased harmful-task refusal by over 20% on injection attacks, decreased privacy leakage, and maintained or improved benign task performance across various out-of-distribution benchmarks.

Key takeaway

For engineering teams developing agentic language models that interact with tools, MOSAIC offers a robust post-training framework to enhance safety. Your models can achieve up to 50% reduction in harmful behavior and over 20% increased refusal on injection attacks by implementing explicit safety reasoning and refusal mechanisms. Consider adopting a "plan, check, then act or refuse" loop to mitigate risks associated with multi-step tool use and improve overall agent reliability.

Key insights

MOSAIC aligns agentic models for safe multi-step tool use by making safety decisions explicit and learnable.

Principles

Method

MOSAIC structures inference as a "plan, check, then act or refuse" loop, trained via preference-based reinforcement learning using pairwise trajectory comparisons to capture safety distinctions.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.