Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data
Summary
An audio-language model (Audio-LLM) has been developed to filter noisy training data for end-to-end speech-to-speech translation (S2ST). Large-scale corpora often contain noise, misalignment, and semantic errors, which degrade S2ST performance. This new model makes keep/drop decisions on paired speech directly from audio. It is trained using a scalable two-stage Rank-to-Distill strategy: a lightweight ranker first generates keep/drop pseudo-labels from noisy speech pairs, then an Audio-LLM learns to predict these decisions from raw paired speech. The resulting model effectively captures both acoustic fidelity and cross-lingual semantic consistency for data selection. Experiments on CVSS-C and SpeechMatrix datasets demonstrate consistent improvements over unfiltered training, achieving up to +1.4 ASR-BLEU for S2ST.
Key takeaway
If you are an ML engineer or NLP specialist working on speech-to-speech translation, this research offers a robust method to improve your model's performance. Implementing the two-stage Rank-to-Distill strategy allows you to effectively filter noisy large-scale training datasets directly from audio. This approach enhances data quality by ensuring acoustic fidelity and semantic consistency, potentially yielding significant gains, such as the reported +1.4 ASR-BLEU on S2ST tasks.
Key insights
An audio-LLM can effectively filter noisy speech-to-speech translation data using a Rank-to-Distill strategy, improving S2ST performance.
Principles
- Filtering noisy data is crucial for robust S2ST.
- Audio-LLMs can capture acoustic fidelity and semantic consistency.
- Pseudo-labeling enables scalable supervision for data filtering.
Method
A two-stage Rank-to-Distill strategy: a lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio-LLM to predict these decisions directly from raw paired speech.
In practice
- Apply Rank-to-Distill for S2ST data curation.
- Train audio-LLMs for direct audio-based filtering.
- Improve S2ST models with filtered CVSS-C/SpeechMatrix data.
Topics
- Speech-to-Speech Translation
- Audio-Language Models
- Data Filtering
- Training Data Curation
- Rank-to-Distill
- ASR-BLEU
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.