Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

2026-06-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, quick

Summary

A systematic study reveals that large language models (LLMs) exhibit a predictable failure pattern in electronic health record (EHR) question answering, where accuracy declines significantly as the number of inferential steps, or "hop count," increases. Researchers introduced a pre-specified hop-count taxonomy to quantify reasoning depth, annotating 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels. Evaluating Claude Sonnet 4-6, GPT-4o, and GPT-5.4-2026-03-05, all models demonstrated a monotone accuracy decline. Claude Sonnet zero-shot dropped from 30.6% at hop=1 to 17.6% at hop=4, GPT-4o from 37.8% to 14.7%, and GPT-5.4-2026-03-05 from 37.8% to 23.5%. This decline was not due to EHR truncation, as higher-hop questions showed comparable context sufficiency. Extended thinking strategies did not significantly improve performance or flatten the accuracy-depth curve, though thinking-token usage scaled with hop count (r=0.31, p<0.0001). This establishes hop count as a cross-architecture predictor of LLM error in clinical AI.

Key takeaway

For Machine Learning Engineers deploying LLMs in clinical settings, you must account for compositional reasoning depth. Your models will predictably fail more often on EHR questions requiring multiple inferential steps, even with advanced models like GPT-5. Stratify deployment risks by assessing the "hop count" of anticipated queries. Prioritize simpler, low-hop questions for automation and design systems to break down complex clinical questions into less inferentially demanding sub-questions to improve reliability.

Key insights

LLM accuracy in EHR QA systematically declines with increased compositional reasoning depth, quantified by "hop count."

Principles

LLM performance degrades predictably with reasoning complexity.
Transformer compositionality limits impact clinical AI reliability.
Hop count serves as a robust failure predictor.

Method

Introduced a hop-count taxonomy for EHR questions, annotating 313 clinician-generated MedAlign pairs. Evaluated LLMs (Claude Sonnet, GPT-4o, GPT-5.4) zero-shot and with extended thinking.

In practice

Stratify clinical AI deployment risks by question hop count.
Prioritize simpler EHR queries for LLM automation.
Design prompts to minimize inferential steps.

Topics

Large Language Models
Electronic Health Records
Clinical AI
Compositional Reasoning
Model Failure Prediction
Hop Count Taxonomy

Best for: AI Architect, AI Engineer, NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.