How Kimi, Cursor, and Chroma Train Agentic Models with RL
Summary
Three recent technical reports from Moonshot AI (Kimi K2.5), Cursor (Composer 2), and Chroma (Context-1) detail distinct approaches to training agentic models using reinforcement learning (RL). Kimi K2.5 introduces "Agent Swarm," where a 1T parameter / 32B active MoE multimodal model learns to decompose tasks into parallel sub-agents via RL, significantly reducing inference latency by up to 4.5x and improving accuracy on benchmarks like BrowseComp (78.4% vs. 60.6% single-agent). Cursor's Composer 2, an agentic software engineering model, employs self-summarization for long coding sessions and utilizes real-time RL from production traffic, enabling multiple daily checkpoint improvements. Chroma's Context-1, a 20B parameter search model, focuses on "self-editing context," learning to prune irrelevant documents to free up context space for further search, achieving competitive retrieval performance with frontier LLMs at lower cost. All three share common RL methodologies, including starting from strong base models, training within production-like environments, using outcome-based rewards, and employing asynchronous, large-scale rollouts.
Key takeaway
For AI Architects and Research Scientists designing agentic systems, these reports highlight that specialized RL training, coupled with robust context management and production-aligned evaluation, is crucial for achieving high performance and efficiency. Consider implementing techniques like parallel agent orchestration, self-summarization, or self-editing context to overcome limitations of sequential processing and fixed context windows. Your focus should be on iterative reward design and developing internal benchmarks that reflect real-world usage to avoid reward hacking and ensure practical utility.
Key insights
Agentic models achieve superior performance and efficiency through specialized RL training and context management.
Principles
- Train where you deploy.
- Context management is a first-class problem.
- Reward design is iterative.
Method
RL training involves starting with a strong base model, running rollouts in production-like environments, using outcome-based rewards, and scaling asynchronous rollouts.
In practice
- Implement self-summarization for long agentic tasks.
- Utilize parallel sub-agents to reduce latency.
- Develop internal benchmarks from real user data.
Topics
- Agentic Model Training
- Reinforcement Learning
- Agent Swarm Architecture
- Self-summarization
- Self-editing Context
Code references
Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by philschmid.de - RSS feed.