A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

2026-05-04 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Safe Decoupled Guidance Diffusion (SDGD) is a novel diffusion-based planner designed for offline safe reinforcement learning, enabling policies to adapt to dynamic safety budgets. Unlike existing methods that treat reward and constraint satisfaction as competing objectives, SDGD reinterprets adaptive safe trajectory generation as sampling from a constrained distribution. It employs classifier-free guidance conditioned on the cost limit to bias sampling towards compliant trajectories, while using reward-gradient guidance for performance. To prevent reward guidance from inadvertently increasing cumulative cost, SDGD introduces Feasible Trajectory Relabeling (FTR), which reshapes reward targets. A first-order sampling-time analysis demonstrates FTR's ability to suppress reward-induced cost drift. Evaluations on the DSRL benchmark show SDGD achieved 94.7% safety compliance across 36 out of 38 tasks, outperforming baselines and securing the highest reward among safe methods on 21 tasks.

Key takeaway

For research scientists developing safe reinforcement learning agents, SDGD offers a robust approach to managing dynamic safety budgets. You should consider integrating cost-conditioned classifier-free guidance and Feasible Trajectory Relabeling into your diffusion-based planners. This method significantly improves safety compliance (94.7% on DSRL) while maintaining high reward, addressing a critical challenge in adaptive policy deployment.

Key insights

SDGD uses cost-conditioned diffusion and reward gradients for adaptive, safe, and high-performing trajectory generation.

Principles

Cost limits define trajectory regions.
Reward shapes preferences within regions.
FTR suppresses reward-induced cost drift.

Method

SDGD conditions classifier-free guidance on cost limits for safety, then applies reward-gradient guidance for performance, using Feasible Trajectory Relabeling (FTR) to prevent cost increases.

In practice

Apply cost-conditioned generation.
Use reward gradients for refinement.
Implement FTR to manage cost drift.

Topics

Safe Reinforcement Learning
Diffusion Models
Trajectory Generation
Cost-Conditioned Guidance
Reward Gradients

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.