Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency
Summary
A study investigating large language models (LLMs) in medical triage found significant gender-dependent disparities in emergency room (ER) referral rates for identical neurological symptoms. Using Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini, researchers presented a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) to patients across various age and gender demographics. Young women received substantially lower ER referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). This disparity vanished at age 65. The primary mechanism identified was diagnostic substitution, where models preferentially diagnosed young women with Idiopathic Intracranial Hypertension (IIH) and men with generic increased intracranial pressure, routing women to lower-urgency outpatient care despite comparable symptom severity.
Key takeaway
For AI Scientists and Research Scientists developing clinical LLMs, this study highlights a critical need to mitigate gender-dependent diagnostic bias. Your models, like Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini, may replicate human biases, leading to unequal triage urgency. You must decouple urgency assessment from probabilistic diagnostic priors and rigorously test for demographic disparities, especially concerning conditions with epidemiological links to specific groups, to ensure equitable patient care.
Key insights
LLMs exhibit systemic gender bias in medical triage, driven by diagnostic substitution that lowers urgency for young women.
Principles
- LLMs replicate human clinical biases.
- Epidemiological priors can suppress triage urgency.
- Decouple urgency assessment from diagnostic priors.
Method
Standardized symptom profiles were presented to Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini across age and gender conditions (630 trials) to assess triage recommendations.
In practice
- Test clinical LLMs for demographic biases.
- Review LLM diagnostic substitution patterns.
- Implement urgency-diagnosis decoupling.
Topics
- LLM Bias
- Medical Triage
- Diagnostic Substitution
- Gender Disparity
- Clinical AI
- Healthcare AI
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.