Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Medical Devices & Health Technology · Depth: Expert, extended

Summary

A study evaluated multimodal large language models (LLMs) as automated raters for the Clock Drawing Test (CDT) on a six-level ordinal clinical scale (0-5), comparing them against supervised deep learning models. Benchmarking three LLM families (GPT-5, GPT-5.4, Gemini-2.5-Pro, Claude-4-Sonnet) against Vision Transformers (ViT) and ResNet-101 on two public datasets, researchers found that fully fine-tuned ViT models achieved the best calibration (MAE 0.52, within-1 accuracy 91%). While zero-shot LLMs like GPT-5 were competitive in tolerance-based agreement (MAE 0.67, within-1 accuracy 92%), they exhibited a significant "central tendency effect." This bias systematically compresses predictions toward the middle of the scale, over-predicting low scores (0→1) and under-predicting high scores (5→4), disproportionately affecting clinically critical extremes. Ablation studies showed that neither few-shot exemplars nor removing clinical terminology eliminated this intrinsic LLM scoring bias.

Key takeaway

For AI Scientists and Research Scientists developing clinical assessment tools, you should be aware that multimodal LLMs, despite strong aggregate performance, exhibit a central tendency bias that systematically misrepresents extreme scores. This bias is not easily mitigated by prompt engineering and can have significant clinical consequences. Therefore, you must implement calibration-aware evaluation protocols and consider post-hoc calibration or using supervised models for final scoring in high-stakes screening workflows to ensure reliable identification of critical scale endpoints.

Key insights

Multimodal LLMs exhibit a central tendency bias in clinical ordinal scoring, compressing predictions towards the scale's middle.

Principles

Method

The study used an audit protocol combining per-score error decomposition, calibration-slope analysis, and prompt-ablation suites to distinguish prompt-engineering artifacts from intrinsic model behavior in clinical ordinal scoring.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.