Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text
Summary
A study analyzing the ScAN dataset, built over MIMIC-III clinical notes, reveals how dataset construction significantly shapes suicidality detection in clinical Natural Language Processing. The research argues that electronic health record (EHR)-based suicidality datasets encode a specific "documentation-mediated, episodic, and intent-resolved" operationalization of suicidality, rather than providing neutral ground truth. Using a case study of ScAN, which comprises 12,759 clinical notes from 697 hospital stays and 19,690 span-level annotations, the analysis demonstrates that governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation result in labels reflecting clinician judgments, bounding suicidality to discrete episodes, and inferring intent from documentation. A linguistic examination of 15,585 annotated spans further shows that identical labels often cover heterogeneous clinical framings, varying in temporality, negation, and uncertainty, with 27.8% of present-SI spans containing historical markers. The dataset also exhibits a narrow demographic profile, being 70% White and 92% English-speaking.
Key takeaway
For NLP Engineers developing suicidality detection models, you must critically examine the underlying dataset's construction assumptions. Your models trained on documentation-mediated, episodic, and intent-resolved labels may exhibit systematic errors, especially with retrospective language or ambiguous cases. You should evaluate performance across different clinical note sections and consider retaining "unsure" as a distinct category to preserve clinically meaningful uncertainty. Incorporating multi-annotator labeling and patient-authored sources can also enhance transparency and robustness.
Key insights
Dataset construction choices operationalize suicidality in EHRs, impacting NLP model reliability.
Principles
- EHR-based labels reflect documentation, not unmediated patient states.
- Suicidality operationalization can flatten temporality and resolve ambiguity.
- Dataset construction choices are often invisible defaults.
Method
The study conducted a linguistic framing analysis on 15,585 annotated spans from the ScAN dataset, identifying negation, historical reference, and uncertainty using lexical indicators and MedSpaCy's ConText algorithm.
In practice
- Evaluate model performance separately by clinical note section.
- Retain "unsure" as a distinct modeling category.
- Incorporate patient-authored sources alongside clinician notes.
Topics
- Clinical NLP
- Suicidality Detection
- Dataset Operationalization
- Electronic Health Records
- ScAN Dataset
- Annotation Bias
- Linguistic Framing Analysis
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.