Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA
Summary
The study investigates medical domain adaptation strategies for Large Language Models (LLMs) in French medical Question-Answering (QA). Researchers compared continual pretraining (CPT), supervised fine-tuning (SFT), and their combination (CPT+SFT) across three model families (Mistral-7B, Gemma-4B, Llama-7B/13B), multiple sizes, and three initialization types (General, Instruct, Medical). Evaluation covered multiple-choice QA (MCQA) and open-ended QA (OEQA) using automatic metrics and LLM-as-a-Judge. For MCQA, CPT+SFT often achieved the best scores, but gains over SFT were small and frequently not statistically significant, making SFT a cost-effective default. For OEQA, CPT consistently improved overlap-based metrics, while SFT often degraded quality; LLM-based evaluation preferred instruction tuning and CPT+SFT. Cross-lingual experiments showed effective transfer from French adaptation to English benchmarks. The study also found that translated benchmarks inflate accuracy and confidence.
Key takeaway
For Machine Learning Engineers adapting LLMs for French medical QA, you should prioritize Supervised Fine-Tuning (SFT) on labeled data for multiple-choice tasks. SFT offers the best performance-efficiency trade-off, often matching CPT+SFT gains with significantly lower computational costs (e.g., \$360 vs. \$1,500 for 7B models). If open-ended QA is critical, consider Continual Pretraining (CPT) for better lexical overlap, but interpret results cautiously due to verbosity bias. Be aware that translated benchmarks can inflate accuracy.
Key insights
SFT offers the best performance-efficiency trade-off for medical MCQA, while CPT benefits OEQA but with caveats.
Principles
- CPT+SFT yields highest MCQA scores, but gains over SFT are often marginal.
- Instruction-tuned models are strong baselines for French medical MCQA.
- Translated benchmarks inflate accuracy and alter confidence calibration.
Method
The study systematically compared CPT, SFT, and CPT+SFT on French medical QA using Mistral, Gemma, and Llama models, varying initialization and evaluating MCQA/OEQA with constrained/greedy decoding and LLM-as-a-Judge.
In practice
- Prioritize SFT for medical MCQA due to cost-effectiveness.
- Use CPT for OEQA to improve lexical overlap metrics.
- Be cautious with translated benchmarks for evaluation.
Topics
- Medical LLMs
- Domain Adaptation
- Supervised Fine-Tuning
- Continual Pretraining
- French Medical QA
- Cross-lingual Transfer
Code references
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.