Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

2026-02-22 · Source: The Cognitive Revolution · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Advanced, extended

Summary

Olive Song, a Senior Researcher at MiniMax, details the training methodologies for their M-series open-weight models, including the M2 and upcoming M2.2 and M2.5. The discussion, a crossover from her AI Engineer Conference talk and an interview with Turing Post's Inference podcast, highlights the use of reinforcement learning, tight product feedback loops, and systematic environment perturbations. MiniMax's strategy involves developing both foundation models and user-facing applications in-house, fostering direct feedback from expert developers. Key aspects covered include "interleaved thinking" for long-horizon agentic tasks, combating reward hacking, the decision to train RL models at FP32 precision, and debugging real-world LLM failures. The M2 model, with 10 billion active parameters, is optimized for coding and workplace agentic tasks, demonstrating robust generalization and multi-agent scalability.

Key takeaway

For AI Scientists and Research Scientists focused on developing robust, agentic models, consider MiniMax's integrated approach. Your team should prioritize tight feedback loops with in-house developers and systematically perturb training environments to enhance generalization. Additionally, investigate the impact of FP32 precision in reinforcement learning to mitigate subtle training inaccuracies, which can significantly improve model stability and performance in complex, real-world applications.

Key insights

MiniMax trains open-weight models using RL, expert feedback, and environment perturbations for robust, agentic performance.

Principles

Integrate product feedback directly into model development.
Perturb training environments to encourage robust generalization.
Align models with human expectations to prevent unsafe behaviors.

Method

MiniMax employs interleaved thinking, where models iteratively think, act, and receive environmental feedback, enabling adaptation to dynamic, noisy real-world conditions for long-horizon tasks.

In practice

Utilize FP32 precision for RL training to reduce theoretical-to-implementation gaps.
Develop internal AI agents to track and summarize new research and industry developments.

Topics

Reinforcement Learning
MiniMax M-series
AI Agents
Open-weight Models
Model Evaluation

Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Cognitive Revolution.