Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)

· Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Sander Dieleman, a research scientist at Google DeepMind, provides a behind-the-scenes look at training large-scale generative models, focusing on diffusion models for audiovisual data. He details eight key areas: data curation, representation, modeling, architecture, training at scale, sampling, distillation, and control. Dieleman emphasizes the critical role of data curation for high-quality results and explains the use of learned compressed latent representations via autoencoders to manage large audiovisual data, reducing tensor sizes by up to two orders of magnitude. He describes diffusion as a process of iterative refinement, where a denoiser learns to reverse a gradual noise corruption process, effectively performing "spectral auto-regression" by generating images from coarse to fine frequencies. The discussion also covers network architectures like U-Nets and Transformers, the importance of parallelism in training, and advanced sampling techniques such as guidance, which significantly improves sample quality by amplifying the difference between conditional and unconditional predictions.

Key takeaway

For research scientists and computer vision engineers developing generative models, prioritize robust data curation and consider learned latent representations to manage computational load for high-resolution audiovisual data. When deploying diffusion models, leverage guidance during sampling to achieve superior output quality, understanding that this will reduce sample diversity but is crucial for production-grade results. Explore advanced conditioning signals beyond text prompts to enable more precise control over generated content.

Key insights

Diffusion models generate high-quality audiovisual data through iterative denoising of learned compressed latent representations.

Principles

Method

Train an autoencoder to learn compressed latent representations of audiovisual data. Subsequently, train a diffusion model on these latents, using a denoiser to reverse a gradual noise corruption process for iterative sample generation.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.