Introducing VibeVoice ASR: Longform, Structured Speech Recognition At Scale

2026-03-12 · Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

Microsoft Research has released VibeVoice ASR, a new unified speech-to-text model designed to transcribe up to 60 minutes of continuous audio in a single pass. Available through Foundry Model Catalog and Hugging Face, VibeVoice ASR produces rich, structured output that identifies speakers and timestamps their utterances. This model unifies transcription, speaker diarization, and timestamping into a single inference pass, processing long recordings holistically to preserve global context. It aims to address challenges in real-world audio like long meetings, multi-speaker conversations, and domain-specific terminology, supporting production-scale scenarios where accuracy and structure are critical.

Key takeaway

For NLP Engineers and AI Architects building solutions for long-form audio, VibeVoice ASR offers a streamlined approach. Its unified transcription, diarization, and timestamping in a single pass can simplify your pipelines and improve accuracy for complex, real-world audio like meetings and podcasts. You should explore its integration via Hugging Face Transformers or Microsoft Foundry for production-scale deployments.

Key insights

VibeVoice ASR unifies long-form speech transcription, diarization, and timestamping into a single, context-preserving model.

Principles

Process long audio holistically.
Unify speech tasks into one model.

Method

VibeVoice ASR integrates transcription, speaker diarization, and timestamping into a single model and inference pass, processing up to 60 minutes of continuous audio to maintain global context.

In practice

Transcribe hourlong meetings.
Integrate into agentic pipelines.
Use customized hotwords.

Topics

VibeVoice ASR
Speech Recognition
Speaker Diarization
Longform Audio
Microsoft Foundry

Best for: NLP Engineer, AI Architect, AI Product Manager, Machine Learning Engineer, AI Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.