Distilling Drifting Transformers with Representation Autoencoders
Summary
Drift-RAE is a new method for distilling pretrained flow models within Representation Autoencoder (RAE) latent spaces, achieving improved stability and performance. RAEs utilize DINO features for semantically richer latent spaces, but their anisotropy and large curvatures typically hinder trajectory-based distillation. This work quantitatively studies curvatures and isotropy across autoencoders, revealing that Drifting Models often fail on scattered spaces like VAEs. Drift-RAE applies the drifting paradigm directly to RAEs, incorporating modifications that theoretically align drifting fields with other frameworks to enhance training stability. Experimentally, Drift-RAE achieved a 1.77 FID on the ImageNet 256 dataset using only 10,000 distillation steps, outperforming existing RAE distillation methods and matching the original Drifting Model without needing an auxiliary MAE feature extractor.
Key takeaway
For Machine Learning Engineers optimizing diffusion or flow model inference, Drift-RAE offers a path to more efficient and stable distillation. By utilizing RAE latent spaces and the Drifting paradigm, you can achieve superior performance, such as 1.77 FID on ImageNet 256, with significantly fewer distillation steps (10k) and without requiring an auxiliary MAE feature extractor. Consider integrating Drift-RAE to streamline your model compression workflows and improve computational efficiency.
Key insights
Drift-RAE enables stable, high-performance distillation of flow models in semantically rich Representation Autoencoder latent spaces.
Principles
- RAE latent spaces are compatible with Drifting Model distillation.
- Drifting Models struggle with extremely scattered latent spaces.
- Aligning drifting fields improves training stability.
Method
Drift-RAE distills pretrained flow models in RAE latent spaces using Drifting, with modifications to align drifting fields and improve training stability.
In practice
- Achieves 1.77 FID on ImageNet 256 with 10k steps.
- Eliminates need for an auxiliary MAE feature extractor.
Topics
- Model Distillation
- Representation Autoencoders
- Drifting Models
- Flow Models
- ImageNet
- Latent Space
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.