StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

StyleVAR, a new framework for reference-based image style transfer, leverages Visual Autoregressive Modeling (VAR) to balance content preservation and style intensity. Released on April 21, 2026, StyleVAR formulates style transfer as conditional discrete sequence modeling in a multi-scale latent space, using a VQ-VAE to tokenize images into discrete codes. A transformer then autoregressively models target tokens conditioned on style and content tokens, employing a blended cross-attention mechanism with a scale-dependent blending coefficient. The model is trained in two stages: supervised fine-tuning on a large triplet dataset (OmniStyle-150K and ImagePulse) and reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, incorporating per-action normalization weighting. StyleVAR consistently outperforms an AdaIN baseline across three benchmarks (in-, near-, and out-of-distribution) on metrics like Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, with GRPO yielding further gains. While excelling in landscapes and architectural scenes, it shows a generalization gap on internet images and struggles with human faces, attributed to training data limitations.

Key takeaway

For research scientists developing advanced image generation models, StyleVAR demonstrates a robust approach to balancing content and style in image transfer. You should consider integrating multi-scale autoregressive modeling with blended cross-attention for precise control. The two-stage training, particularly the GRPO-based reinforcement fine-tuning with perceptual rewards and PANW, offers a powerful method to enhance output quality and address credit assignment challenges in multi-scale architectures. Evaluate your training data diversity to mitigate generalization gaps, especially for complex subjects like human faces.

Key insights

StyleVAR uses multi-scale autoregressive modeling with blended cross-attention and GRPO to achieve high-quality, controllable image style transfer.

Principles

Method

StyleVAR trains in two stages: supervised fine-tuning on paired triplets, followed by GRPO-based reinforcement fine-tuning using a DreamSim perceptual reward and Per-Action Normalization Weighting (PANW) to rebalance credit across VAR's multi-scale hierarchy.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.