Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, quick

Summary

A systematic study reveals that large language models (LLMs) exhibit a predictable failure pattern in electronic health record (EHR) question answering, where accuracy declines significantly as the number of inferential steps, or "hop count," increases. Researchers introduced a pre-specified hop-count taxonomy to quantify reasoning depth, annotating 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels. Evaluating Claude Sonnet 4-6, GPT-4o, and GPT-5.4-2026-03-05, all models demonstrated a monotone accuracy decline. Claude Sonnet zero-shot dropped from 30.6% at hop=1 to 17.6% at hop=4, GPT-4o from 37.8% to 14.7%, and GPT-5.4-2026-03-05 from 37.8% to 23.5%. This decline was not due to EHR truncation, as higher-hop questions showed comparable context sufficiency. Extended thinking strategies did not significantly improve performance or flatten the accuracy-depth curve, though thinking-token usage scaled with hop count (r=0.31, p<0.0001). This establishes hop count as a cross-architecture predictor of LLM error in clinical AI.

Key takeaway

For Machine Learning Engineers deploying LLMs in clinical settings, you must account for compositional reasoning depth. Your models will predictably fail more often on EHR questions requiring multiple inferential steps, even with advanced models like GPT-5. Stratify deployment risks by assessing the "hop count" of anticipated queries. Prioritize simpler, low-hop questions for automation and design systems to break down complex clinical questions into less inferentially demanding sub-questions to improve reliability.

Key insights

LLM accuracy in EHR QA systematically declines with increased compositional reasoning depth, quantified by "hop count."

Principles

Method

Introduced a hop-count taxonomy for EHR questions, annotating 313 clinician-generated MedAlign pairs. Evaluated LLMs (Claude Sonnet, GPT-4o, GPT-5.4) zero-shot and with extended thinking.

In practice

Topics

Best for: AI Architect, AI Engineer, NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.