Class of LLMs: Benchmarking Large Language Models on the Brazilian National Medical Examination

· Source: Paper Index on ACL Anthology · Field: Health & Wellbeing — Health & Medical Research, Medical Devices & Health Technology, Healthcare Systems & Policy · Depth: Intermediate, quick

Summary

Researchers evaluated twenty-two proprietary and open-weight Large Language Models (LLMs) using the 2025 Brazilian National Examination for the Evaluation of Medical Training (ENAMED). This high-stakes, government-standardized assessment, designed for medical graduates in Brazil, consists of 90 multiple-choice questions covering Brazilian public health policy, clinical practice, and Portuguese medical terminology. The study measured model performance using both standard accuracy and the official Item Response Theory (IRT) framework, allowing direct comparison with human proficiency thresholds. The results indicated that proprietary frontier models achieved the highest scores, while many open-weight and smaller domain-adapted models did not meet the minimum proficiency. Large generalist models consistently outperformed specialized medical fine-tunes, suggesting that general reasoning capacity is more critical than narrow domain adaptation for this type of medical assessment. The ENAMED dataset is now openly released.

Key takeaway

For AI Engineers developing medical LLMs for non-English healthcare systems, you should prioritize evaluating models against high-stakes, language-specific benchmarks like ENAMED. Your focus should be on enhancing general reasoning capabilities in large models rather than solely relying on narrow domain adaptation, as generalist LLMs demonstrated superior performance in this Brazilian medical examination.

Key insights

Generalist LLMs often outperform specialized medical fine-tunes on high-stakes, language-specific medical exams.

Principles

Method

LLMs were benchmarked on the 2025 Brazilian ENAMED using 90 multiple-choice questions, with performance measured by standard accuracy and Item Response Theory (IRT) for human proficiency comparison.

In practice

Topics

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.