microsoft / VibeVoice

2025-08-25 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, short

Summary

VibeVoice is an open-source family of voice AI models from Microsoft, encompassing Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) capabilities. A key innovation is its use of continuous speech tokenizers operating at an ultra-low 7.5 Hz frame rate, enhancing computational efficiency for long audio sequences. The VibeVoice-ASR model, released on 2026-01-21, processes up to 60 minutes of audio in a single pass, providing structured transcriptions with speaker identification, timestamps, and content, supporting over 50 languages and customized hotwords. The VibeVoice-Realtime-0.5B model, open-sourced on 2025-12-03, offers real-time streaming TTS for up to 10 minutes of speech with a 0.5B parameter size and ~300ms first audible latency. VibeVoice-TTS, initially released on 2025-08-25, supported long-form multi-speaker speech generation up to 90 minutes but its code was later removed due to misuse concerns.

Key takeaway

For AI Engineers and Research Scientists developing speech applications, VibeVoice offers robust open-source models for long-form ASR and real-time TTS. You should explore VibeVoice-ASR for complex transcription needs requiring speaker diarization and custom hotwords, and VibeVoice-Realtime-0.5B for low-latency, deployment-friendly speech synthesis. Be mindful of the stated risks regarding potential misuse and biases, and ensure responsible deployment.

Key insights

VibeVoice offers open-source, long-form, and real-time voice AI models for advanced speech recognition and synthesis.

Principles

Ultra-low frame rate tokenizers boost efficiency for long sequences.
Unified models can perform ASR, diarization, and timestamping jointly.
Contextual guidance improves ASR accuracy for domain-specific content.

Method

VibeVoice employs continuous speech tokenizers at 7.5 Hz and a next-token diffusion framework, using an LLM for textual context and a diffusion head for high-fidelity acoustic detail generation.

In practice

Integrate VibeVoice-ASR via Hugging Face Transformers v5.3.0+.
Use VibeVoice-ASR's Playground for quick speech-to-text testing.
Explore VibeVoice-Realtime-0.5B for streaming TTS applications.

Topics

VibeVoice
Open-Source Voice AI
Automatic Speech Recognition
Text-to-Speech
Long-form Audio Processing

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.