Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Mental Health & Psychological Support · Depth: Expert, extended

Summary

A study analyzing the ScAN dataset, built over MIMIC-III clinical notes, reveals how dataset construction significantly shapes suicidality detection in clinical Natural Language Processing. The research argues that electronic health record (EHR)-based suicidality datasets encode a specific "documentation-mediated, episodic, and intent-resolved" operationalization of suicidality, rather than providing neutral ground truth. Using a case study of ScAN, which comprises 12,759 clinical notes from 697 hospital stays and 19,690 span-level annotations, the analysis demonstrates that governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation result in labels reflecting clinician judgments, bounding suicidality to discrete episodes, and inferring intent from documentation. A linguistic examination of 15,585 annotated spans further shows that identical labels often cover heterogeneous clinical framings, varying in temporality, negation, and uncertainty, with 27.8% of present-SI spans containing historical markers. The dataset also exhibits a narrow demographic profile, being 70% White and 92% English-speaking.

Key takeaway

For NLP Engineers developing suicidality detection models, you must critically examine the underlying dataset's construction assumptions. Your models trained on documentation-mediated, episodic, and intent-resolved labels may exhibit systematic errors, especially with retrospective language or ambiguous cases. You should evaluate performance across different clinical note sections and consider retaining "unsure" as a distinct category to preserve clinically meaningful uncertainty. Incorporating multi-annotator labeling and patient-authored sources can also enhance transparency and robustness.

Key insights

Dataset construction choices operationalize suicidality in EHRs, impacting NLP model reliability.

Principles

Method

The study conducted a linguistic framing analysis on 15,585 annotated spans from the ScAN dataset, identifying negation, historical reference, and uncertainty using lexical indicators and MedSpaCy's ConText algorithm.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.