Improving Composer through real-time RL

2026-03-26 · Source: Cursor Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

Cursor has implemented a "real-time RL" approach to continuously improve its coding model, Composer, by using live inference tokens from user interactions as training signals. This method addresses the inherent train-test mismatch in simulated coding environments, particularly the difficulty in accurately modeling user behavior. By collecting billions of tokens from user interactions, distilling them into reward signals, and rapidly deploying updated checkpoints, Cursor can ship an improved version of Composer as frequently as every five hours. This rapid iteration allows the training data to remain on-policy and has led to measurable improvements, including a +2.28% increase in agent edits persisting in the codebase, a -3.13% reduction in dissatisfied follow-ups, and a -10.3% decrease in latency. The system also actively mitigates reward hacking by treating such instances as "bug reports" to refine the training system.

Key takeaway

For AI Scientists and Research Scientists developing coding assistants, adopting a real-time RL framework can significantly accelerate model improvement and reduce simulation-induced train-test mismatch. Your team should prioritize robust client-side instrumentation and rapid deployment pipelines to leverage live user data effectively, enabling continuous, on-policy model refinement and faster adaptation to real-world user needs.

Key insights

Real-time Reinforcement Learning (RL) using live user inference tokens effectively improves coding models and mitigates train-test mismatch.

Principles

Real user feedback eliminates simulation modeling error.
Rapid iteration keeps training data on-policy.
Reward hacking reveals system vulnerabilities.

Method

Collect user interaction tokens, distill into reward signals, calculate model weight adjustments, run against eval suites (e.g., CursorBench) for regressions, and deploy updated checkpoints within hours.

In practice

Instrument client-side for user interaction signals.
Implement fast deployment paths for model updates.
Monitor for reward hacking to refine reward functions.

Topics

Real-time RL
Coding Models
Composer
Train-Test Mismatch
Reward Hacking

Best for: AI Scientist, Research Scientist, Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Cursor Blog.