Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

2026-05-18 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

DeepMind's GenMedia models, including Nano Banana 2, VIO 3.1 Light, and LIA, offer advanced multimodal generation capabilities for images, videos, and music. Nano Banana 2 supports new aspect ratios and image grounding for enhanced search, while VIO 3.1 Light provides a cost-effective solution for video generation at $0.05 per second. LIA, the music generation model, can create 30-second clips or full 3-minute songs, with a real-time variant allowing dynamic music changes. The presentation highlighted a practical application: illustrating an open-source book using Gemini to generate prompts and GenMedia models to produce corresponding visual and auditory content. It also detailed the distinction between Google's AI Studio Gemini API and Vertex AI, emphasizing the former's developer-friendly approach and the latter's enterprise-grade control.

Key takeaway

For AI Engineers building multimodal applications, you should explore DeepMind's GenMedia models, particularly by integrating Gemini for prompt generation to ensure content consistency and quality. Consider using the Interactions API for stateful context management to optimize performance and cost, especially when working with large inputs like entire books. Be mindful of regional model availability and associated costs, opting for cheaper "light" models for iterative development before upscaling.

Key insights

DeepMind's GenMedia models offer multimodal AI for creative content generation, leveraging Gemini for intelligent prompting.

Principles

Multimodal input and output are central to "world model" vision.
Developer advocacy ensures real-world product utility.
Iterative prompt refinement improves generative AI output.

Method

Utilize Gemini to generate structured prompts for characters and scenes from a book, then feed these prompts to GenMedia models (Nano Banana 2, VIO, LIA) to create consistent images, videos, and music, optionally incorporating character references for visual consistency.

In practice

Use Gemini to generate prompts for GenMedia models.
Employ chat mode for context-aware content generation.
Experiment with LIA Real-time for dynamic, adaptive music.

Topics

GenMedia Models
Multimodal AI
Developer Advocacy
Gemini API
LIA Music Generation

Best for: AI Engineer, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.