\textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

2026-05-20 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

Stochastic MeanFlow Policies (SMFP) are introduced as a novel one-step generative policy class designed for online off-policy reinforcement learning. This approach addresses the limitations of traditional Gaussian policies, which are fast but struggle with multimodal action distributions, and existing generative policies, which are expressive but often require iterative sampling or lack tractable entropy estimates. SMFP maps Gaussian noise to actions through a MeanFlow transformation, yielding a tractable entropy surrogate. This allows for training within off-policy mirror descent under a unified objective, promoting both exploration and stable policy improvement. Across seven MuJoCo benchmarks, SMFP demonstrated superior performance compared to both Gaussian and other generative baselines, while crucially maintaining single-step inference efficiency.

Key takeaway

If you are a Machine Learning Engineer seeking to improve online off-policy reinforcement learning, consider Stochastic MeanFlow Policies (SMFP). This one-step generative solution addresses multimodal action distributions and balances exploration with stable policy improvement. It maintains single-step inference efficiency and outperforms traditional Gaussian and other generative baselines on continuous control tasks. You can achieve more expressive and stable control policies without sacrificing computational speed.

Key insights

Stochastic MeanFlow Policies enable expressive, stable, and efficient generative control in off-policy reinforcement learning.

Principles

Gaussian policies offer speed but lack multimodal expressiveness.
Generative policies provide expressiveness but face optimization challenges.
Entropy regularization stabilizes policy improvement and exploration.

Method

Stochastic MeanFlow Policies (SMFP) map Gaussian noise to actions via a MeanFlow transformation. They are trained within off-policy mirror descent using a unified objective, leveraging a tractable entropy surrogate.

In practice

Apply SMFP for improved continuous control tasks.
Enable multimodal action generation in RL.
Balance exploration and stability in policy updates.

Topics

Reinforcement Learning
Generative Policies
Entropic Mirror Descent
Stochastic MeanFlow Policies
MuJoCo Benchmarks
Multimodal Control

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.