Closing the Gap Between Text and Speech Understanding in LLMs

2026-02-25 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Large Language Models (LLMs) adapted for speech inputs consistently underperform their text-based versions and cascaded pipelines on language understanding tasks, a phenomenon termed the "text-speech understanding gap." Existing methods to close this gap often rely on expensive large-scale speech synthesis or proprietary datasets, creating a need for more data-efficient solutions. This research identifies two primary drivers for the gap: forgetting of text capabilities during adaptation and cross-modal misalignment between speech and text. To address this, the authors introduce SALAD (Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation), a method that combines cross-modal distillation with targeted synthetic data. SALAD aims to improve alignment and mitigate forgetting, achieving competitive performance on 3B and 7B LLMs across broad-domain benchmarks in knowledge, language understanding, and reasoning, using significantly less public speech data.

Key takeaway

For research scientists developing speech-adapted LLMs, you should consider methods like SALAD that address both text capability forgetting and cross-modal misalignment. Focusing on sample-efficient alignment with targeted synthetic data can significantly improve performance on language understanding tasks while reducing reliance on extensive, costly speech datasets.

Key insights

Speech-adapted LLMs underperform text-based counterparts due to forgetting and cross-modal misalignment.

Principles

Forgetting text capabilities degrades speech-adapted LLMs.
Cross-modal misalignment impacts speech understanding.

Method

SALAD combines cross-modal distillation with targeted synthetic data to improve alignment and reduce forgetting in speech-adapted LLMs.

In practice

Use targeted synthetic data for LLM adaptation.
Apply cross-modal distillation for better alignment.

Topics

Large Language Models
Speech Understanding
Cross-modal Alignment
Data-efficient Learning
SALAD

Best for: Research Scientist, AI Researcher, AI Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.