CrossFlow: One-Step Generation Across Latent and Pixel Spaces

2026-06-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Advanced, quick

Summary

CrossFlow introduces a novel cross-space flow formulation for image generation, addressing the efficiency-quality mismatch in latent diffusion models. Traditional latent diffusion optimizes generators in latent space, but relies on a separately trained decoder for final pixel-space output, which may struggle with generated latents. CrossFlow directly maps noisy latent inputs to pixel-space images using a velocity-free one-step objective, where the latent trajectory guides training but the prediction target is an image. This allows a single model to function as both a one-step latent-to-pixel generator and a decoder replacement, eliminating the need for a separate decoder during inference. CrossFlow-XL achieved a 1.62 FID on class-conditional ImageNet-1k at 256x256 with just one function evaluation. Its fidelity relies on a latent encoder and pixel-space perceptual and adversarial losses.

Key takeaway

For Machine Learning Engineers optimizing image generation pipelines, CrossFlow presents a significant advancement by unifying latent and pixel space generation. If your current latent diffusion models suffer from decoder-induced quality mismatches or require multiple inference steps, you should investigate cross-space flow objectives. This approach allows a single model to generate high-fidelity images directly from latents with one function evaluation, potentially streamlining your inference process and improving output quality.

Key insights

CrossFlow directly maps noisy latents to pixel-space images, combining latent efficiency with direct pixel supervision in one model.

Principles

Cross-space flow objectives enhance efficiency.
Direct pixel-space supervision improves fidelity.
Integrating generator and decoder streamlines inference.

Method

CrossFlow uses a velocity-free one-step objective, defining the training path via latent trajectory while supervising prediction directly in pixel space.

In practice

Replace separate decoders in latent diffusion.
Generate high-fidelity images with one evaluation.
Optimize for latent efficiency and pixel quality.

Topics

CrossFlow
Image Generation
Latent Diffusion
Flow-matching
Pixel Space
FID Score

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.