JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

Contextual Automatic Speech Recognition (ASR) struggles with large keyword dictionaries, where irrelevant candidates degrade accuracy, particularly in Chinese ASR due to homophonic errors that mislead semantic retrievers. Researchers from Soochow University propose JSPG, a novel filtering framework that integrates Semantic, Pinyin, and Glyph features to dynamically filter keyword dictionaries. JSPG addresses the limitations of semantic-only approaches by leveraging pinyin for phonetic similarity and glyph features for structural cues, which are crucial for distinguishing homophones in Chinese. The framework introduces an extended Smith-Waterman algorithm to compute sequence-level similarity between N-best ASR hypotheses and keywords. Experiments on Aishell-1 and RWCS-NER datasets show JSPG significantly outperforms single-feature baselines, leading to substantial improvements in keyword recognition accuracy for downstream contextual ASR models like CopyNE, Multi-grained, and LLM-Refine.

Key takeaway

For research scientists developing Chinese contextual ASR systems, you should integrate multi-modal features beyond semantics to overcome homophonic challenges. Specifically, consider adopting the JSPG framework's joint semantic-pinyin-glyph approach to dynamically filter large keyword dictionaries, as it demonstrably improves keyword recognition accuracy and overall system robustness, especially in noisy, real-world scenarios. This can significantly reduce Keyword-CER and boost Keyword Recall.

Key insights

Jointly integrating semantic, pinyin, and glyph features significantly improves Chinese contextual ASR keyword retrieval accuracy.

Principles

Homophonic errors in Chinese ASR distort semantics but preserve phonetic cues.
Glyph features provide unique structural discrimination for homophones.
Feature fusion enhances retrieval robustness over single modalities.

Method

JSPG uses a two-stage process: a base ASR generates N-best hypotheses, then an extended Smith-Waterman algorithm computes joint semantic, pinyin, and glyph similarity scores to filter keywords for downstream contextual ASR models.

In practice

Use Qwen3-Embedding for semantic scoring.
Apply normalized Levenshtein Distance for pinyin similarity.
Average four sub-metrics for glyph similarity.

Topics

Contextual ASR
Dynamic Dictionary Filtering
Chinese Language Processing
Semantic-Pinyin-Glyph Retrieval
Extended Smith-Waterman Algorithm

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.