Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

2025-06-22 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

Edit-R2 is a novel reinforcement learning framework designed for multi-turn in-context image editing, addressing challenges like long-context dilution and state contamination. Developed by Hong Kong University of Science and Technology and Kuaishou Technology, it reconstructs operative session intent using in-context chain-of-thought (IC-CoT) and employs a unified multi-turn RL objective for both discrete text reasoning and continuous latent space image generation. A trajectory filtering mechanism stabilizes training. To support evaluation, Edit-R2 introduces MICE-Bench, a large-scale benchmark with 720 three-turn editing instances, using automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA). Experiments show Edit-R2 significantly improves multi-turn editing, achieving gains of +18% IF and +18% GA over BAGEL at Turn 2, and +8% IF and +15% GA at Turn 3, outperforming other open-source models and competing with closed-source counterparts.

Key takeaway

For AI Scientists and Machine Learning Engineers developing interactive image editing systems, traditional single-turn models are inadequate for realistic multi-turn user sessions. You should consider integrating explicit session intent reconstruction, like Edit-R2's IC-CoT, and unified multi-turn reinforcement learning. This approach effectively mitigates long-context dilution and error propagation, crucial for maintaining consistency and instruction adherence across iterative edits, as demonstrated by Edit-R2's significant performance gains on MICE-Bench.

Key insights

Multi-turn image editing benefits from explicit session intent reconstruction and unified RL to manage long-context and error propagation.

Principles

Explicitly reconstruct session intent from interleaved image-text history.
Jointly optimize discrete reasoning and continuous visual generation.
Filter corrupted rollouts to stabilize multi-turn reinforcement learning.

Method

Edit-R2 reconstructs operative intent via IC-CoT, then jointly optimizes IC-CoT generation and flow-matching image generation using unified multi-turn RL with prefix-valid advantage refinement.

In practice

Implement IC-CoT to distill historical constraints.
Employ trajectory filtering to prevent training noise.
Utilize GA, IF, CC for multi-turn editing evaluation.

Topics

Multi-turn Image Editing
Reinforcement Learning
In-Context Chain-of-Thought
Multimodal Foundation Models
MICE-Bench
Global Awareness Metric

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.