Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling
Summary
A blinded multi-rater evaluation assessed a retrieval-grounded large language model (LLM) conversational agent (CA) for continuous glucose monitoring (CGM)-informed diabetes counseling. The study, conducted between October 2025 and February 2026, involved 12 CGM-informed cases and 6 senior UK diabetes clinicians. Each clinician reviewed 2 cases and answered 24 questions, with both CA-generated and clinician-authored responses independently rated by 3 clinicians across 6 quality dimensions. The CA received significantly higher quality scores (mean 4.37) compared to clinician responses (mean 3.58), showing an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P<.001). Notable differences were observed in empathy (1.062) and actionability (0.992). Safety flag distributions were similar, with major concerns being rare (0.7% for both).
Key takeaway
For AI Scientists developing healthcare applications, this study indicates that retrieval-grounded LLMs can significantly improve patient understanding and pre-consultation preparation in diabetes care. You should focus on integrating such models as adjunct tools for CGM review and patient education, but strictly avoid autonomous therapeutic decision-making or unsupervised real-world deployment due to safety considerations.
Key insights
Retrieval-grounded LLMs can outperform clinicians in specific aspects of CGM-informed diabetes counseling.
Principles
- LLMs can enhance patient education.
- Blinded evaluation reduces bias.
Method
A retrieval-grounded LLM-based conversational agent was developed to generate plain-language responses for CGM interpretation, avoiding individualized therapeutic advice. Responses were evaluated by clinicians in a blinded multi-rater setup.
In practice
- Use LLMs for pre-consultation prep.
- Apply LLMs for patient education.
Topics
- Large Language Models
- Continuous Glucose Monitoring
- Diabetes Counseling
- Retrieval-Grounded Systems
- Multi-Rater Evaluation
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.