Improving Composer through real-time RL
Summary
Cursor has implemented a "real-time RL" approach to continuously improve its coding model, Composer, by using live inference tokens from user interactions as training signals. This method addresses the inherent train-test mismatch in simulated coding environments, particularly the difficulty in accurately modeling user behavior. By collecting billions of tokens from user interactions, distilling them into reward signals, and rapidly deploying updated checkpoints, Cursor can ship an improved version of Composer as frequently as every five hours. This rapid iteration allows the training data to remain on-policy and has led to measurable improvements, including a +2.28% increase in agent edits persisting in the codebase, a -3.13% reduction in dissatisfied follow-ups, and a -10.3% decrease in latency. The system also actively mitigates reward hacking by treating such instances as "bug reports" to refine the training system.
Key takeaway
For AI Scientists and Research Scientists developing coding assistants, adopting a real-time RL framework can significantly accelerate model improvement and reduce simulation-induced train-test mismatch. Your team should prioritize robust client-side instrumentation and rapid deployment pipelines to leverage live user data effectively, enabling continuous, on-policy model refinement and faster adaptation to real-world user needs.
Key insights
Real-time Reinforcement Learning (RL) using live user inference tokens effectively improves coding models and mitigates train-test mismatch.
Principles
- Real user feedback eliminates simulation modeling error.
- Rapid iteration keeps training data on-policy.
- Reward hacking reveals system vulnerabilities.
Method
Collect user interaction tokens, distill into reward signals, calculate model weight adjustments, run against eval suites (e.g., CursorBench) for regressions, and deploy updated checkpoints within hours.
In practice
- Instrument client-side for user interaction signals.
- Implement fast deployment paths for model updates.
- Monitor for reward hacking to refine reward functions.
Topics
- Real-time RL
- Coding Models
- Composer
- Train-Test Mismatch
- Reward Hacking
Best for: AI Scientist, Research Scientist, Machine Learning Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Cursor Blog.