Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

2026-06-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A study investigated large language models' (LLMs) abilities in predicting turn-taking dynamics within multimodal multi-party conversations, specifically focusing on addressee detection, turn-change prediction, and next speaker prediction. Experiments conducted on the AMI corpus compared text-based LLMs, multimodal LLMs (MM-LLMs), supervised models, and human subjects. The findings revealed that text-based LLMs surprisingly surpassed both supervised models and human performance in next speaker prediction, despite lacking domain-specific training and access to audio or visual information. While MM-LLMs improved upon text-based LLMs for addressee detection and turn-change prediction, they did not reach human-level accuracy, indicating challenges in effectively utilizing raw audio-visual signals. Ablation analyses highlighted the critical role of conversational context, particularly for accurate next speaker prediction, and noted similar prediction patterns and shared difficulties during frequent turn changes between humans and LLMs.

Key takeaway

For NLP Engineers developing meeting transcription or assistant tools, you should prioritize text-based LLMs for next speaker prediction, as they outperform even humans without multimodal data. However, for addressee detection or turn-change prediction, consider multimodal LLMs, but be aware they may still fall short of human accuracy in leveraging raw audio-visual signals. Focus on robust conversational context handling to improve overall turn-taking predictions.

Key insights

Text-based LLMs excel at next speaker prediction in meetings, even without multimodal input.

Principles

Conversational context is critical for turn-taking.
MM-LLMs struggle with raw audio-visual signal integration.
Frequent turn changes challenge both humans and LLMs.

Method

An evaluation framework was constructed for addressee detection, turn-change, and next speaker prediction, comparing supervised models, text-based LLMs, MM-LLMs, and humans on the AMI corpus.

In practice

Prioritize text context for next speaker prediction.
Evaluate MM-LLMs carefully for multimodal signal use.
Anticipate difficulty in rapid turn-taking scenarios.

Topics

Large Language Models
Turn-taking Prediction
Multimodal LLMs
Conversational AI
Meeting Transcription
Addressee Detection
Next Speaker Prediction

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.