Continuous Audio Thinking for Large Audio Language Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Continuous Audio Thinking (CoAT) is a novel framework designed to enhance Large Audio Language Models (LALMs) by providing a continuous latent workspace for organizing acoustic information prior to generating textual responses. LALMs often lose rich acoustic details like phonetic nuances, prosody, and sound events as their internal states prioritize text generation. CoAT addresses this by inserting a "thinking block" between audio input and response, supervised through distillation from five audio experts covering reconstruction, speech content, sound events, paralinguistic features, and pitch. This block is processed in a single prefill, incurring no additional autoregressive decoding cost. Evaluated across Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo 3, CoAT consistently improved performance on diverse benchmarks, including audio reasoning, understanding, music classification, speech emotion, and transcription, while demonstrating lower latency than discrete text chain-of-thought methods.

Key takeaway

For Machine Learning Engineers developing Large Audio Language Models, you should consider integrating Continuous Audio Thinking (CoAT) to significantly enhance acoustic understanding and reasoning. CoAT improves performance on tasks like speech emotion and music classification while reducing inference latency compared to text chain-of-thought. Implement its two-stage expert distillation to ground latent states in diverse acoustic dimensions, ensuring your models retain critical non-textual audio information.

Key insights

CoAT enables LALMs to retain and organize acoustic information in a continuous latent space, improving understanding without added inference cost.

Principles

Method

CoAT inserts a continuous latent "thinking block" between audio input and response. This block's hidden states are supervised by distilling frame-level features from five audio experts using a two-stage training schedule.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.