Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data

2026-06-11 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

An audio-language model (Audio-LLM) has been developed to filter noisy training data for end-to-end speech-to-speech translation (S2ST). Large-scale corpora often contain noise, misalignment, and semantic errors, which degrade S2ST performance. This new model makes keep/drop decisions on paired speech directly from audio. It is trained using a scalable two-stage Rank-to-Distill strategy: a lightweight ranker first generates keep/drop pseudo-labels from noisy speech pairs, then an Audio-LLM learns to predict these decisions from raw paired speech. The resulting model effectively captures both acoustic fidelity and cross-lingual semantic consistency for data selection. Experiments on CVSS-C and SpeechMatrix datasets demonstrate consistent improvements over unfiltered training, achieving up to +1.4 ASR-BLEU for S2ST.

Key takeaway

If you are an ML engineer or NLP specialist working on speech-to-speech translation, this research offers a robust method to improve your model's performance. Implementing the two-stage Rank-to-Distill strategy allows you to effectively filter noisy large-scale training datasets directly from audio. This approach enhances data quality by ensuring acoustic fidelity and semantic consistency, potentially yielding significant gains, such as the reported +1.4 ASR-BLEU on S2ST tasks.

Key insights

An audio-LLM can effectively filter noisy speech-to-speech translation data using a Rank-to-Distill strategy, improving S2ST performance.

Principles

Filtering noisy data is crucial for robust S2ST.
Audio-LLMs can capture acoustic fidelity and semantic consistency.
Pseudo-labeling enables scalable supervision for data filtering.

Method

A two-stage Rank-to-Distill strategy: a lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio-LLM to predict these decisions directly from raw paired speech.

In practice

Apply Rank-to-Distill for S2ST data curation.
Train audio-LLMs for direct audio-based filtering.
Improve S2ST models with filtered CVSS-C/SpeechMatrix data.

Topics

Speech-to-Speech Translation
Audio-Language Models
Data Filtering
Training Data Curation
Rank-to-Distill
ASR-BLEU

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.