Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

2026-06-17 · Source: Artificial Intelligence · Field: Health & Wellbeing — Medical Devices & Health Technology, Health & Medical Research, Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

PhysAssistBench is a new benchmark designed to evaluate Large Language Models (LLMs) for interactive doctor-patient-EHR assistance, addressing the gap where current evaluations test only isolated capabilities like clinical knowledge or patient communication. Built from real MIMIC-IV cases, PhysAssistBench employs a scalable pipeline to construct "agentic patients" that transform static EHR records into multi-turn clinical scenarios while preserving clinical factuality. The benchmark features a curated bilingual evaluation set comprising 1,296 manually reviewed and physician-validated turns. Experiments reveal that leading LLMs remain unreliable in this interactive setting, highlighting that reliable clinical assistance necessitates robust coordination across knowledge, communication, and system interaction, rather than mere isolated improvements in any single area.

Key takeaway

For AI Scientists and Machine Learning Engineers developing medical LLMs, recognize that current models are unreliable for interactive physician assistance due to a lack of coordinated capabilities. You should prioritize integrating knowledge, patient communication, and EHR system interaction within your LLM designs and evaluation strategies. Focus on multi-turn, underspecified request handling to build truly assistive clinical AI.

Key insights

Medical LLM assistance requires coordinated capabilities across knowledge, communication, and systems, not isolated gains.

Principles

Physician assistance needs coordinated LLM capabilities.
Real clinical requests are often underspecified.
Clinical factuality is paramount in simulations.

Method

PhysAssistBench constructs interactive, record-grounded "agentic patients" from MIMIC-IV cases via a scalable pipeline, generating multi-turn clinical scenarios for evaluation.

In practice

Test LLMs on integrated clinical workflows.
Design LLMs for ambiguous, multi-turn interactions.
Ensure factual grounding in medical AI agents.

Topics

Medical LLMs
Physician Assistance
AI Benchmarking
Interactive AI
EHR Systems
MIMIC-IV Dataset

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.