microsoft/VibeVoice

· Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, quick

Summary

Microsoft released VibeVoice, an MIT-licensed, Whisper-style speech-to-text audio model with integrated speaker diarization, on January 21st, 2026. A user successfully ran a 4-bit quantized version of the 17.3GB VibeVoice-ASR model (converted to MLX by mlx-community/VibeVoice-ASR-4bit, 5.71GB) on a 128GB M5 Max MacBook Pro. Using `uv` and `mlx-audio`, the model processed 99.8 minutes of audio (trimmed to 59 minutes due to a one-hour limit) in 524.79 seconds (8 minutes 45 seconds). Peak memory usage was reported as 30.44GB by the tool, though Activity Monitor showed 61.5GB during prefill. The output was a JSON array of text segments, each with start/end times, duration, and a `speaker_id`, which accurately identified three distinct speakers from a podcast.

Key takeaway

For ML Engineers working with audio transcription on Apple Silicon, VibeVoice offers a robust, locally executable solution with built-in speaker diarization. You should consider using the `mlx-community/VibeVoice-ASR-4bit` model for efficient inference. Be aware of the one-hour audio processing limit and plan to pre-split longer audio files, ensuring some overlap to maintain transcription quality across segments.

Key insights

VibeVoice offers efficient, diarized speech-to-text on local hardware, but has a one-hour audio processing limit.

Principles

Method

Use `uv` with `mlx-audio` to run VibeVoice-ASR-4bit, specifying audio path, output format (JSON), and `--max-tokens` to accommodate longer audio segments up to the one-hour limit.

In practice

Topics

Code references

Best for: Machine Learning Engineer, AI Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.