Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings
Summary
A study investigated large language models' (LLMs) abilities in predicting turn-taking dynamics within multimodal multi-party conversations, specifically focusing on addressee detection, turn-change prediction, and next speaker prediction. Experiments conducted on the AMI corpus compared text-based LLMs, multimodal LLMs (MM-LLMs), supervised models, and human subjects. The findings revealed that text-based LLMs surprisingly surpassed both supervised models and human performance in next speaker prediction, despite lacking domain-specific training and access to audio or visual information. While MM-LLMs improved upon text-based LLMs for addressee detection and turn-change prediction, they did not reach human-level accuracy, indicating challenges in effectively utilizing raw audio-visual signals. Ablation analyses highlighted the critical role of conversational context, particularly for accurate next speaker prediction, and noted similar prediction patterns and shared difficulties during frequent turn changes between humans and LLMs.
Key takeaway
For NLP Engineers developing meeting transcription or assistant tools, you should prioritize text-based LLMs for next speaker prediction, as they outperform even humans without multimodal data. However, for addressee detection or turn-change prediction, consider multimodal LLMs, but be aware they may still fall short of human accuracy in leveraging raw audio-visual signals. Focus on robust conversational context handling to improve overall turn-taking predictions.
Key insights
Text-based LLMs excel at next speaker prediction in meetings, even without multimodal input.
Principles
- Conversational context is critical for turn-taking.
- MM-LLMs struggle with raw audio-visual signal integration.
- Frequent turn changes challenge both humans and LLMs.
Method
An evaluation framework was constructed for addressee detection, turn-change, and next speaker prediction, comparing supervised models, text-based LLMs, MM-LLMs, and humans on the AMI corpus.
In practice
- Prioritize text context for next speaker prediction.
- Evaluate MM-LLMs carefully for multimodal signal use.
- Anticipate difficulty in rapid turn-taking scenarios.
Topics
- Large Language Models
- Turn-taking Prediction
- Multimodal LLMs
- Conversational AI
- Meeting Transcription
- Addressee Detection
- Next Speaker Prediction
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.