Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

A novel Point-of-Interest (POI)-aware contrastive training framework significantly enhances Automatic Speech Recognition (ASR) robustness for code-switching (CS) scenarios. This method addresses challenges in CS-critical regions by generating "near-miss" hypotheses. It first identifies CS spans using POI detection, then perturbs these POIs within ASR N-best outputs, expanding candidates with a large language model (LLM). Hard but acoustically plausible negatives are retained through a filtering process applying acoustic, phonemic, and textual constraints. The framework fine-tunes the Whisper-small model using LoRA, combining a POI-weighted cross-entropy anchor objective with a multi-negative contrastive ranking loss. Evaluations on CS-FLEURS (cmn-eng) and ViMedCSS (vie-eng) datasets demonstrate consistent error rate reductions exceeding 2% in both general and CS-aware metrics, outperforming standard LoRA fine-tuning.

Key takeaway

For Machine Learning Engineers developing robust ASR systems for code-switching languages, you should consider integrating this POI-aware contrastive training. By generating and filtering acoustically plausible "near-miss" hypotheses with LLMs, you can explicitly target and reduce errors in critical language-alternation regions. This approach, demonstrated to reduce error rates by over 2% on benchmarks like CS-FLEURS and ViMedCSS, offers a concrete strategy to enhance model accuracy beyond standard fine-tuning, especially for distinct-script and specialized-domain switching.

Key insights

Contrastive training with LLM-generated near-misses robustly improves code-switching ASR at critical language-alternation points.

Principles

CS errors cluster around language-switch regions.
Explicitly targeting confusable spans improves ASR.
Acoustic plausibility is key for hard negative generation.

Method

Identify code-switching POIs, generate N-best hypotheses, then perturb POIs and expand candidates using an LLM. Filter these "near-misses" via acoustic, phonemic, and textual constraints. Fine-tune ASR with a POI-weighted cross-entropy and multi-negative contrastive ranking loss.

In practice

Use Whisper-small with LoRA for CS-ASR.
Apply POI detection to identify CS spans.
Generate synthetic hard negatives with LLMs.

Topics

Code-Switching ASR
Contrastive Learning
LLM-generated Data
Near-Miss Hypotheses
Whisper Fine-tuning
LoRA

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.