JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR
Summary
Contextual Automatic Speech Recognition (ASR) struggles with large keyword dictionaries, where irrelevant candidates degrade accuracy, particularly in Chinese ASR due to homophonic errors that mislead semantic retrievers. Researchers from Soochow University propose JSPG, a novel filtering framework that integrates Semantic, Pinyin, and Glyph features to dynamically filter keyword dictionaries. JSPG addresses the limitations of semantic-only approaches by leveraging pinyin for phonetic similarity and glyph features for structural cues, which are crucial for distinguishing homophones in Chinese. The framework introduces an extended Smith-Waterman algorithm to compute sequence-level similarity between N-best ASR hypotheses and keywords. Experiments on Aishell-1 and RWCS-NER datasets show JSPG significantly outperforms single-feature baselines, leading to substantial improvements in keyword recognition accuracy for downstream contextual ASR models like CopyNE, Multi-grained, and LLM-Refine.
Key takeaway
For research scientists developing Chinese contextual ASR systems, you should integrate multi-modal features beyond semantics to overcome homophonic challenges. Specifically, consider adopting the JSPG framework's joint semantic-pinyin-glyph approach to dynamically filter large keyword dictionaries, as it demonstrably improves keyword recognition accuracy and overall system robustness, especially in noisy, real-world scenarios. This can significantly reduce Keyword-CER and boost Keyword Recall.
Key insights
Jointly integrating semantic, pinyin, and glyph features significantly improves Chinese contextual ASR keyword retrieval accuracy.
Principles
- Homophonic errors in Chinese ASR distort semantics but preserve phonetic cues.
- Glyph features provide unique structural discrimination for homophones.
- Feature fusion enhances retrieval robustness over single modalities.
Method
JSPG uses a two-stage process: a base ASR generates N-best hypotheses, then an extended Smith-Waterman algorithm computes joint semantic, pinyin, and glyph similarity scores to filter keywords for downstream contextual ASR models.
In practice
- Use Qwen3-Embedding for semantic scoring.
- Apply normalized Levenshtein Distance for pinyin similarity.
- Average four sub-metrics for glyph similarity.
Topics
- Contextual ASR
- Dynamic Dictionary Filtering
- Chinese Language Processing
- Semantic-Pinyin-Glyph Retrieval
- Extended Smith-Waterman Algorithm
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.