Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Mental Health & Psychological Support · Depth: Expert, extended

Summary

A study analyzing the ScAN dataset, built over MIMIC-III clinical notes, reveals how dataset construction significantly shapes suicidality detection in clinical Natural Language Processing. The research argues that electronic health record (EHR)-based suicidality datasets encode a specific "documentation-mediated, episodic, and intent-resolved" operationalization of suicidality, rather than providing neutral ground truth. Using a case study of ScAN, which comprises 12,759 clinical notes from 697 hospital stays and 19,690 span-level annotations, the analysis demonstrates that governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation result in labels reflecting clinician judgments, bounding suicidality to discrete episodes, and inferring intent from documentation. A linguistic examination of 15,585 annotated spans further shows that identical labels often cover heterogeneous clinical framings, varying in temporality, negation, and uncertainty, with 27.8% of present-SI spans containing historical markers. The dataset also exhibits a narrow demographic profile, being 70% White and 92% English-speaking.

Key takeaway

For NLP Engineers developing suicidality detection models, you must critically examine the underlying dataset's construction assumptions. Your models trained on documentation-mediated, episodic, and intent-resolved labels may exhibit systematic errors, especially with retrospective language or ambiguous cases. You should evaluate performance across different clinical note sections and consider retaining "unsure" as a distinct category to preserve clinically meaningful uncertainty. Incorporating multi-annotator labeling and patient-authored sources can also enhance transparency and robustness.

Key insights

Dataset construction choices operationalize suicidality in EHRs, impacting NLP model reliability.

Principles

EHR-based labels reflect documentation, not unmediated patient states.
Suicidality operationalization can flatten temporality and resolve ambiguity.
Dataset construction choices are often invisible defaults.

Method

The study conducted a linguistic framing analysis on 15,585 annotated spans from the ScAN dataset, identifying negation, historical reference, and uncertainty using lexical indicators and MedSpaCy's ConText algorithm.

In practice

Evaluate model performance separately by clinical note section.
Retain "unsure" as a distinct modeling category.
Incorporate patient-authored sources alongside clinician notes.

Topics

Clinical NLP
Suicidality Detection
Dataset Operationalization
Electronic Health Records
ScAN Dataset
Annotation Bias
Linguistic Framing Analysis

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.