Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance
Summary
PhysAssistBench is a new benchmark designed to evaluate large language models (LLMs) in interactive physician assistance scenarios, moving beyond isolated capability tests. Built from real MIMIC-IV cases, it uses a scalable multi-agent pipeline to create "agentic patients" that transform static EHR records into multi-turn clinical interactions while preserving clinical factuality. The benchmark features 324 multi-turn sessions and 1,296 manually reviewed and physician-validated turns, spanning 4 clinical scenarios (Diagnostic Workup, Med Safety, Treatment Response, Discharge Planning), 4 tasks (Information Lookup, Data Gathering, Clinical Reasoning, Write/Update), and 3 physician-query implicitness subtypes (Nominal Anaphora, Predicate Ellipsis, Abstract Event Anaphora). Experiments with 14 leading LLMs, including GPT-5.4, Claude-Opus-4.7, and GLM-5, reveal that current models remain unreliable. For instance, Claude-Opus-4.7 achieved 23.5% (EN) and 26.9% (ZH) Pass@Session at a 0.60 rubric score threshold, highlighting a key bottleneck in coordinating knowledge, communication, and system interaction.
Key takeaway
For AI Scientists and Machine Learning Engineers developing clinical LLMs, you must shift evaluation focus from isolated capabilities to integrated, multi-turn physician assistance. Your models currently struggle with coordinating knowledge, patient communication, and EHR tool use, especially with implicit queries and multi-tool composition. Prioritize developing robust interaction capabilities and addressing language-conditioned biases that inhibit tool invocation, as current models are unreliable for consistent, session-level clinical support.
Key insights
LLMs struggle with interactive physician assistance requiring coordinated knowledge, communication, and EHR system interaction.
Principles
- Physician assistance needs integrated capabilities, not isolated gains.
- Multi-turn interaction and implicit queries significantly degrade LLM performance.
- Sustaining quality across turns is a distinct, harder capability than turn-level accuracy.
Method
PhysAssistBench uses a multi-agent pipeline to transform MIMIC-IV records into interactive, record-grounded "agentic patients" for multi-turn doctor-patient-EHR scenarios, ensuring clinical factuality and scalability.
In practice
- Evaluate LLMs on integrated, multi-turn clinical workflows.
- Focus on improving multi-tool composition and handling implicit requests.
- Address language-conditioned safety priors that suppress EHR tool calls.
Topics
- Medical LLMs
- Physician Assistance
- EHR Interaction
- PhysAssistBench
- Multi-agent Systems
- Clinical Benchmarking
- FHIR R4
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.