Closing the Gap Between Text and Speech Understanding in LLMs
Summary
Large Language Models (LLMs) adapted for speech inputs consistently underperform their text-based versions and cascaded pipelines on language understanding tasks, a phenomenon termed the "text-speech understanding gap." Existing methods to close this gap often rely on expensive large-scale speech synthesis or proprietary datasets, creating a need for more data-efficient solutions. This research identifies two primary drivers for the gap: forgetting of text capabilities during adaptation and cross-modal misalignment between speech and text. To address this, the authors introduce SALAD (Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation), a method that combines cross-modal distillation with targeted synthetic data. SALAD aims to improve alignment and mitigate forgetting, achieving competitive performance on 3B and 7B LLMs across broad-domain benchmarks in knowledge, language understanding, and reasoning, using significantly less public speech data.
Key takeaway
For research scientists developing speech-adapted LLMs, you should consider methods like SALAD that address both text capability forgetting and cross-modal misalignment. Focusing on sample-efficient alignment with targeted synthetic data can significantly improve performance on language understanding tasks while reducing reliance on extensive, costly speech datasets.
Key insights
Speech-adapted LLMs underperform text-based counterparts due to forgetting and cross-modal misalignment.
Principles
- Forgetting text capabilities degrades speech-adapted LLMs.
- Cross-modal misalignment impacts speech understanding.
Method
SALAD combines cross-modal distillation with targeted synthetic data to improve alignment and reduce forgetting in speech-adapted LLMs.
In practice
- Use targeted synthetic data for LLM adaptation.
- Apply cross-modal distillation for better alignment.
Topics
- Large Language Models
- Speech Understanding
- Cross-modal Alignment
- Data-efficient Learning
- SALAD
Best for: Research Scientist, AI Researcher, AI Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.