Gemini 3.1 Flash TTS

· Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

Google released Gemini 3.1 Flash TTS on April 15, 2026, a new text-to-speech model accessible via the standard Gemini API using the `gemini-3.1-flash-tts-preview` model ID. This model uniquely allows for audio generation directed by detailed prompts, which can include "AUDIO PROFILE," "THE SCENE," "DIRECTOR'S NOTES" (covering style, dynamics, pace, and accent), "SAMPLE CONTEXT," and the "TRANSCRIPT." The prompting guide provides an example demonstrating how to specify vocal characteristics like a "Vocal Smile," high projection, energetic pace, and specific regional accents such as Brixton, Newcastle, or Exeter. The model outputs audio files, and a UI for experimentation was developed using Gemini 3.1 Pro.

Key takeaway

For AI Product Managers or Machine Learning Engineers developing audio experiences, Gemini 3.1 Flash TTS offers unprecedented control over generated speech. You should explore its detailed prompting capabilities to create highly customized and expressive voiceovers, ensuring your applications can deliver specific regional accents and nuanced vocal styles without extensive post-processing.

Key insights

Gemini 3.1 Flash TTS offers highly granular, prompt-driven control over speech generation, including accents and vocal styles.

Principles

Method

Users define audio profiles, scene context, director's notes (style, pace, accent), and sample context within a structured prompt to guide the Gemini 3.1 Flash TTS model's audio generation.

In practice

Topics

Best for: Machine Learning Engineer, AI Product Manager, AI Engineer, NLP Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.