Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)
Summary
Sander Dieleman, a research scientist at Google DeepMind, provides a behind-the-scenes look at training large-scale generative models, focusing on diffusion models for audiovisual data. He details eight key areas: data curation, representation, modeling, architecture, training at scale, sampling, distillation, and control. Dieleman emphasizes the critical role of data curation for high-quality results and explains the use of learned compressed latent representations via autoencoders to manage large audiovisual data, reducing tensor sizes by up to two orders of magnitude. He describes diffusion as a process of iterative refinement, where a denoiser learns to reverse a gradual noise corruption process, effectively performing "spectral auto-regression" by generating images from coarse to fine frequencies. The discussion also covers network architectures like U-Nets and Transformers, the importance of parallelism in training, and advanced sampling techniques such as guidance, which significantly improves sample quality by amplifying the difference between conditional and unconditional predictions.
Key takeaway
For research scientists and computer vision engineers developing generative models, prioritize robust data curation and consider learned latent representations to manage computational load for high-resolution audiovisual data. When deploying diffusion models, leverage guidance during sampling to achieve superior output quality, understanding that this will reduce sample diversity but is crucial for production-grade results. Explore advanced conditioning signals beyond text prompts to enable more precise control over generated content.
Key insights
Diffusion models generate high-quality audiovisual data through iterative denoising of learned compressed latent representations.
Principles
- Data curation is paramount for high-quality generative model results.
- Diffusion models perform spectral auto-regression, generating coarse-to-fine features.
- Guidance trades sample diversity for significantly improved quality.
Method
Train an autoencoder to learn compressed latent representations of audiovisual data. Subsequently, train a diffusion model on these latents, using a denoiser to reverse a gradual noise corruption process for iterative sample generation.
In practice
- Use learned latent representations to handle large audiovisual data efficiently.
- Apply guidance during sampling to enhance output quality, especially for specific prompts.
- Consider post-training for fine-tuning model "opinionatedness" and adding control signals.
Topics
- Generative Diffusion Models
- Data Curation
- Latent Representations
- Spectral Auto-regression
- Model Guidance
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.