Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Stable-Layers is a novel reinforcement learning framework designed to fine-tune image layer decomposition models without requiring paired supervision. It leverages feedback solely from a vision-language model (VLM). Starting with the Qwen-Image-Layered model, Stable-Layers employs Flow-GRPO with LoRA adaptation. The process involves sampling multiple candidate decompositions for each image, which are then scored by a VLM, and subsequently, the policy is optimized using group-relative advantages. A significant challenge addressed is the VLM's tendency to compress judgment scores into a narrow band, limiting variance for GRPO. This is overcome by a two-stage evaluation pipeline: initial structured per-sample scoring based on five edit-centric criteria, followed by a grid-based calibration step where the VLM re-scores all candidates side-by-side. The framework demonstrably yields decompositions with enhanced layer separation, reduced blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.

Key takeaway

For Machine Learning Engineers developing image layer decomposition models, Stable-Layers offers a critical shift: you can now fine-tune models like Qwen-Image-Layered using only VLM feedback, eliminating the costly and time-consuming need for paired supervision. This approach, particularly its two-stage VLM scoring, provides a robust method to achieve stronger layer separation and fewer artifacts. You should explore VLM-scored reinforcement learning to accelerate model development and improve decomposition quality without extensive data labeling efforts.

Key insights

Stable-Layers fine-tunes image layer decomposition models using VLM-scored reinforcement learning, eliminating paired supervision and improving decomposition quality.

Principles

VLM feedback can replace paired supervision.
Group-relative advantages improve policy optimization.
Two-stage VLM scoring enhances reward signal.

Method

Stable-Layers applies Flow-GRPO with LoRA to Qwen-Image-Layered, samples decompositions, scores them with a two-stage VLM evaluation, then optimizes policy from group-relative advantages.

In practice

Fine-tune decomposition models without paired data.
Apply two-stage VLM scoring for robust rewards.
Achieve stronger layer separation in images.

Topics

Image Layer Decomposition
Reinforcement Learning
Vision-Language Models
Model Fine-tuning
Qwen-Image-Layered
Flow-GRPO

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.