Improved Baselines with Representation Autoencoders

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

RAEv2 significantly improves Representation Autoencoders (RAE) by introducing three key insights, leading to faster convergence and enhanced performance. First, a generalized formulation defines the representation as the sum of the last k encoder layers, greatly improving reconstruction without encoder finetuning or specialized data. Second, RAEv2 demonstrates that RAE and Representation Alignment (REPA) are complementary, allowing the same pretrained representation to serve as both encoder and target for intermediate diffusion layers. Finally, REPA is reformulated as x-prediction in RAE latent space, enabling "free" internal guidance and eliminating the need for a separate AutoGuidance model or an additional classifier-free guidance forward pass. These advancements result in over 10x faster convergence than original RAE, achieving a gFID of 1.06 in just 80 epochs on ImageNet-256 and an FDrk of 2.17 at 80 epochs. RAEv2 also attains an EP_FID@2 of 35 epochs, validated across text-to-image generation and navigation world models.

Key takeaway

For Machine Learning Engineers optimizing diffusion model training, RAEv2 offers a compelling path to significantly faster convergence and improved performance. You should consider adopting RAEv2's generalized representation encoder, which aggregates features from multiple layers, to achieve Pareto-optimal reconstruction-generation trade-offs. Implementing its self-guidance mechanism, which reuses the REPA head, will reduce computational overhead by eliminating the need for separate guidance models or additional forward passes, accelerating your development cycles and model deployment.

Key insights

RAEv2 significantly accelerates and simplifies Representation Autoencoder training by integrating multi-layer features and complementary guidance mechanisms.

Principles

Method

Define encoder output as the sum of its last k layers (MLS). Combine RAE and REPA with the same pretrained representation. Reformulate the DiT model output to x-prediction, allowing the REPA head to provide internal guidance.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.