CrossFlow: One-Step Generation Across Latent and Pixel Spaces
Summary
CrossFlow introduces a novel cross-space flow formulation for image generation, addressing the efficiency-quality mismatch in latent diffusion models. Traditional latent diffusion optimizes generators in latent space, but relies on a separately trained decoder for final pixel-space output, which may struggle with generated latents. CrossFlow directly maps noisy latent inputs to pixel-space images using a velocity-free one-step objective, where the latent trajectory guides training but the prediction target is an image. This allows a single model to function as both a one-step latent-to-pixel generator and a decoder replacement, eliminating the need for a separate decoder during inference. CrossFlow-XL achieved a 1.62 FID on class-conditional ImageNet-1k at 256x256 with just one function evaluation. Its fidelity relies on a latent encoder and pixel-space perceptual and adversarial losses.
Key takeaway
For Machine Learning Engineers optimizing image generation pipelines, CrossFlow presents a significant advancement by unifying latent and pixel space generation. If your current latent diffusion models suffer from decoder-induced quality mismatches or require multiple inference steps, you should investigate cross-space flow objectives. This approach allows a single model to generate high-fidelity images directly from latents with one function evaluation, potentially streamlining your inference process and improving output quality.
Key insights
CrossFlow directly maps noisy latents to pixel-space images, combining latent efficiency with direct pixel supervision in one model.
Principles
- Cross-space flow objectives enhance efficiency.
- Direct pixel-space supervision improves fidelity.
- Integrating generator and decoder streamlines inference.
Method
CrossFlow uses a velocity-free one-step objective, defining the training path via latent trajectory while supervising prediction directly in pixel space.
In practice
- Replace separate decoders in latent diffusion.
- Generate high-fidelity images with one evaluation.
- Optimize for latent efficiency and pixel quality.
Topics
- CrossFlow
- Image Generation
- Latent Diffusion
- Flow-matching
- Pixel Space
- FID Score
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.