CN-NewsTTS Bench: a target-level automatic benchmark for raw-input Chinese news TTS pronunciation

2026-06-23 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

CN-NewsTTS Bench v0.1 is an open target-level benchmark designed to evaluate the pronunciation accuracy of Chinese news text-to-speech (TTS) systems when processing raw input containing complex written forms. These forms include scores, hyphenated model names, ranges, unit symbols, percentages, English abbreviations, and mixed Chinese-Latin-digit names, which are common in real-world listening scenarios. The benchmark aims to assess systems without reliance on user-side rules, LLM rewriting, SSML hints, or manual edits. The release comprises a 200-record development set, an 800-record public test set, 992 public auto-evaluable targets, fixed transcripts from a three-ASR ensemble, and an automatic target scorer. Initial results for seven product TTS systems show the best system achieving 0.879 strict accuracy, while several others perform below 0.60. The benchmark also provides ASR-route diagnostics, ASR-subset ablations, category-level results, confidence intervals, and provider configuration metadata.

Key takeaway

For NLP Engineers developing or deploying Chinese news TTS systems, you should prioritize robust handling of raw text containing complex written forms. The CN-NewsTTS Bench v0.1 highlights significant performance gaps, with many systems falling below 0.60 strict accuracy on common elements like scores and abbreviations. Evaluate your models against this benchmark to identify weaknesses and ensure your TTS output accurately reflects the intended spoken meaning from unedited input, rather than relying on pre-processing or SSML.

Key insights

CN-NewsTTS Bench evaluates raw-input Chinese news TTS systems on complex written forms without external aids.

Principles

Complex written forms challenge TTS systems significantly.
Raw text input evaluation reveals true system robustness.
ASR ensembles enhance transcript accuracy for TTS benchmarks.

Method

The benchmark uses a 200-record dev set and 800-record public test set with 992 auto-evaluable targets. It employs a three-ASR ensemble for fixed transcripts and an automatic target scorer.

In practice

Use CN-NewsTTS Bench v0.1 to evaluate Chinese news TTS.
Focus TTS development on complex written forms.
Compare system performance against 0.879 strict accuracy.

Topics

Chinese TTS
News Text-to-Speech
TTS Benchmarking
Pronunciation Evaluation
Raw Text Processing
ASR Ensemble

Code references

lugan113/SynTTS-Commands-Official

Best for: AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.