Segment-Level Mandarin Chinese Speech-Based Cognitive Impairment Detection via an Autoencoder with Contrastive Learning

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Science & Research — Health & Medical Research, Mathematics & Computational Sciences, Life Sciences & Biology · Depth: Expert, extended

Summary

A new segment-level representation learning framework has been developed for speech-based cognitive impairment detection, specifically targeting Mandarin Chinese. This framework addresses challenges like limited labeled data and cross-dataset variability by dividing speech recordings into 5-second segments and converting them into spectrograms. It integrates a GRU-based autoencoder with supervised contrastive learning and combines offline and online spectrogram augmentation strategies. Experiments on four independent Mandarin Chinese speech datasets demonstrated stable and competitive performance, achieving overall accuracy exceeding 96%. The highest accuracy of 98.61% was obtained on the Ye dataset, and 96.83% on the NCMMSC2021 dataset for the more challenging three-class classification. Ablation studies confirmed that supervised contrastive learning is crucial for performance, with offline augmentation providing complementary benefits.

Key takeaway

For AI Scientists developing speech-based diagnostic tools, this framework offers a robust approach for cognitive impairment detection, especially in low-resource Mandarin Chinese contexts. You should consider segment-level modeling with GRU autoencoders and supervised contrastive learning. Integrating both offline and online spectrogram augmentation will significantly enhance model stability and discriminative power, improving accuracy in challenging multi-class scenarios.

Key insights

Segment-level speech representation with autoencoders and contrastive learning improves cognitive impairment detection in low-resource settings.

Principles

Combine reconstruction and contrastive objectives.
Augment spectrograms both offline and online.
Segment speech to increase training data.

Method

The framework segments speech into 5-second log-Mel spectrograms, uses a GRU autoencoder for reconstruction, and applies supervised contrastive learning with combined offline/online spectrogram augmentation to enhance discriminative latent representations.

In practice

Apply 5-second segmentation for low-resource speech.
Use SpecAugment for both offline and online data views.
Implement GRU autoencoders with supervised contrastive loss.

Topics

Cognitive Impairment Detection
Speech Processing
Autoencoders
Contrastive Learning
Spectrogram Augmentation
Mandarin Chinese

Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.