OpenBMB / VoxCPM

2025-09-16 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

OpenBMB has released VoxCPM2, a 2B parameter, tokenizer-free Text-to-Speech (TTS) system designed for multilingual speech generation, creative voice design, and true-to-life cloning. Trained on over 2 million hours of multilingual speech data, this model supports 30 languages and outputs 48kHz studio-quality audio using AudioVAE V2's asymmetric encode/decode design. VoxCPM2 features Voice Design, allowing voice creation from natural-language descriptions, and Controllable Voice Cloning, which enables timbre cloning with adjustable style guidance. It also offers Ultimate Cloning for precise vocal nuance reproduction. The system provides real-time streaming with an RTF as low as ~0.13 on an NVIDIA RTX 4090 when accelerated by Nano-vLLM or vLLM-Omni. Released under the Apache-2.0 license, VoxCPM2 is fully open-source and commercial-ready, demonstrating competitive performance across various multilingual TTS benchmarks.

Key takeaway

For AI Engineers developing multilingual speech applications, VoxCPM2 offers a robust, open-source solution for high-quality, controllable TTS. You should consider integrating its tokenizer-free architecture for superior naturalness and expressiveness across 30 languages. Its Voice Design and Controllable Cloning features enable rapid prototyping and personalized audio experiences, while Nano-vLLM or vLLM-Omni integration ensures efficient, real-time production deployment. Evaluate its performance against your specific language and cloning needs.

Key insights

VoxCPM2's tokenizer-free, diffusion autoregressive architecture delivers highly natural, multilingual, and controllable TTS with advanced cloning.

Principles

Tokenizer-free TTS enhances naturalness.
Diffusion autoregressive models generate continuous speech.
Asymmetric VAE designs enable 48kHz audio.

Method

VoxCPM operates via a four-stage pipeline (LocEnc → TSLM → RALM → LocDiT) within AudioVAE V2's latent space, directly generating continuous speech. It supports SFT and LoRA fine-tuning with minimal audio data.

In practice

Generate new voices from natural language descriptions.
Clone voices from short audio clips with style control.
Deploy with Nano-vLLM or vLLM-Omni for high throughput.

Topics

VoxCPM2
Multilingual TTS
Voice Cloning
Voice Design
Diffusion Models
Real-time Inference

Code references

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.