Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring
Summary
A study evaluated multimodal large language models (LLMs) as automated raters for the Clock Drawing Test (CDT) on a six-level ordinal clinical scale (0-5), comparing them against supervised deep learning models. Benchmarking three LLM families (GPT-5, GPT-5.4, Gemini-2.5-Pro, Claude-4-Sonnet) against Vision Transformers (ViT) and ResNet-101 on two public datasets, researchers found that fully fine-tuned ViT models achieved the best calibration (MAE 0.52, within-1 accuracy 91%). While zero-shot LLMs like GPT-5 were competitive in tolerance-based agreement (MAE 0.67, within-1 accuracy 92%), they exhibited a significant "central tendency effect." This bias systematically compresses predictions toward the middle of the scale, over-predicting low scores (0→1) and under-predicting high scores (5→4), disproportionately affecting clinically critical extremes. Ablation studies showed that neither few-shot exemplars nor removing clinical terminology eliminated this intrinsic LLM scoring bias.
Key takeaway
For AI Scientists and Research Scientists developing clinical assessment tools, you should be aware that multimodal LLMs, despite strong aggregate performance, exhibit a central tendency bias that systematically misrepresents extreme scores. This bias is not easily mitigated by prompt engineering and can have significant clinical consequences. Therefore, you must implement calibration-aware evaluation protocols and consider post-hoc calibration or using supervised models for final scoring in high-stakes screening workflows to ensure reliable identification of critical scale endpoints.
Key insights
Multimodal LLMs exhibit a central tendency bias in clinical ordinal scoring, compressing predictions towards the scale's middle.
Principles
- Aggregate metrics can mask critical failure modes.
- LLM scoring bias is intrinsic, not merely a prompt artifact.
Method
The study used an audit protocol combining per-score error decomposition, calibration-slope analysis, and prompt-ablation suites to distinguish prompt-engineering artifacts from intrinsic model behavior in clinical ordinal scoring.
In practice
- Evaluate LLM raters with per-score error analysis.
- Consider post-hoc calibration for LLM-based clinical tools.
Topics
- Multimodal LLMs
- Clinical Ordinal Scoring
- Clock Drawing Test
- Central Tendency Bias
- Cognitive Impairment Screening
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.