Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection

2026-06-26 · Source: Artificial Intelligence · Field: Health & Wellbeing — Medical Devices & Health Technology, Health & Medical Research, Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A novel multimodal framework has been developed for the early detection of dementia through speech analysis, addressing challenges in capturing both acoustic and linguistic biomarkers. This system utilizes Whisper for dual-purpose extraction, generating acoustic representations from its encoder outputs and transcripts via automatic speech recognition (ASR). The acoustic pathway employs temporal networks with attention pooling to aggregate variable-length sequences into fixed-dimensional embeddings. Concurrently, a large language model (LLM) is prompted to extract interpretable linguistic features, including lexical diversity, syntactic complexity, semantic coherence, and discourse patterns. A gated fusion network then integrates these two distinct modalities. The method achieved F1-scores of 89.47% on the ADReSS dataset and 90.14% on ADReSSo, with ablation studies confirming that multimodal fusion consistently surpasses the performance of either modality used in isolation.

Key takeaway

For AI Scientists developing diagnostic tools, this research suggests integrating multimodal speech analysis can significantly boost dementia detection accuracy. If you are designing systems for early screening, consider combining acoustic features from ASR encoders with LLM-derived linguistic markers. Your models will benefit from the complementary strengths of both data types, achieving F1-scores above 89% on established benchmarks like ADReSS and ADReSSo.

Key insights

Jointly learning ASR embeddings and LLM-augmented linguistics significantly improves dementia detection from speech.

Principles

Multimodal fusion enhances diagnostic accuracy.
LLMs can extract nuanced linguistic biomarkers.
Acoustic and linguistic features are complementary.

Method

The framework uses Whisper for acoustic embeddings and ASR transcripts. An LLM extracts linguistic features, which are then integrated with acoustic data via a gated fusion network for dementia detection.

In practice

Apply Whisper for dual acoustic/transcript extraction.
Prompt LLMs for diverse linguistic feature sets.
Integrate modalities with gated fusion networks.

Topics

Dementia Detection
Speech Analysis
Multimodal AI
ASR Embeddings
Large Language Models
Linguistic Biomarkers

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.