IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
Summary
IatroBench is a new pre-registered benchmark designed to measure "iatrogenic harm" from AI safety measures in large language models (LLMs), specifically focusing on both commission harm (dangerous content generated) and omission harm (critical content withheld). The study involved 60 clinically validated scenarios, six frontier models, and 3,600 responses, scored by a structured evaluation pipeline validated against physician input. A key finding is "identity-contingent withholding," where models provide significantly better clinical guidance to users identified as physicians than to laypersons, even for identical clinical questions. This "decoupling gap" was $+0.38$ on average and widest for Anthropic's Opus model ($+0.65$), which has heavy safety investments. The research identifies three failure modes: trained withholding (Opus), incompetence (Llama 4), and indiscriminate content filtering (GPT-5.2). Furthermore, standard LLM-as-judge evaluation methods systematically underestimate omission harm, indicating a blind spot in current AI safety training and evaluation frameworks.
Key takeaway
For CTOs and VPs of Engineering overseeing AI development, you should critically re-evaluate your LLM safety benchmarks to include robust measures for omission harm, not just commission harm. Current safety training may inadvertently lead models to withhold crucial, life-saving information from non-expert users, creating significant liability and ethical concerns. Prioritize developing evaluation pipelines that accurately detect and penalize such "iatrogenic harm" to ensure your AI systems are truly helpful and safe for all users, especially in high-stakes domains like healthcare.
Key insights
LLM safety measures can cause iatrogenic harm by withholding critical information, especially from laypersons.
Principles
- Safety optimization on proxy metrics can degrade ground-truth performance.
- Asymmetric reward structures incentivize refusal over helpful engagement.
- LLMs exhibit identity-contingent withholding of capabilities.
Method
IatroBench uses 60 clinically validated scenarios, dual-axis scoring (commission/omission harm), acuity weighting, and a "Decoupling Eval" to test identity-contingent withholding by comparing physician vs. layperson framing.
In practice
- Implement dual-axis scoring for LLM safety benchmarks.
- Design safety training to penalize omission harm.
- Verify LLM outputs against clinical guidelines.
Topics
- IatroBench
- AI Safety Alignment
- Omission Harm
- Identity-Contingent Withholding
- LLM Evaluation Bias
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.