Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning
Summary
Stable-Layers is a novel reinforcement learning framework designed to fine-tune image layer decomposition models without requiring paired supervision. It leverages feedback solely from a vision-language model (VLM). Starting with the Qwen-Image-Layered model, Stable-Layers employs Flow-GRPO with LoRA adaptation. The process involves sampling multiple candidate decompositions for each image, which are then scored by a VLM, and subsequently, the policy is optimized using group-relative advantages. A significant challenge addressed is the VLM's tendency to compress judgment scores into a narrow band, limiting variance for GRPO. This is overcome by a two-stage evaluation pipeline: initial structured per-sample scoring based on five edit-centric criteria, followed by a grid-based calibration step where the VLM re-scores all candidates side-by-side. The framework demonstrably yields decompositions with enhanced layer separation, reduced blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.
Key takeaway
For Machine Learning Engineers developing image layer decomposition models, Stable-Layers offers a critical shift: you can now fine-tune models like Qwen-Image-Layered using only VLM feedback, eliminating the costly and time-consuming need for paired supervision. This approach, particularly its two-stage VLM scoring, provides a robust method to achieve stronger layer separation and fewer artifacts. You should explore VLM-scored reinforcement learning to accelerate model development and improve decomposition quality without extensive data labeling efforts.
Key insights
Stable-Layers fine-tunes image layer decomposition models using VLM-scored reinforcement learning, eliminating paired supervision and improving decomposition quality.
Principles
- VLM feedback can replace paired supervision.
- Group-relative advantages improve policy optimization.
- Two-stage VLM scoring enhances reward signal.
Method
Stable-Layers applies Flow-GRPO with LoRA to Qwen-Image-Layered, samples decompositions, scores them with a two-stage VLM evaluation, then optimizes policy from group-relative advantages.
In practice
- Fine-tune decomposition models without paired data.
- Apply two-stage VLM scoring for robust rewards.
- Achieve stronger layer separation in images.
Topics
- Image Layer Decomposition
- Reinforcement Learning
- Vision-Language Models
- Model Fine-tuning
- Qwen-Image-Layered
- Flow-GRPO
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.