Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA
Summary
An empirical study investigates trade-offs in medical Large Language Model (LLM) adaptation, focusing on French medical question-answering (QA). Researchers compared continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types. For multiple-choice QA (MCQA), CPT+SFT generally achieved the highest scores, but SFT alone proved a strong and cost-effective default due to small, often non-significant gains from CPT. In open-ended QA (OEQA), CPT consistently improved overlap-based metrics, while SFT frequently degraded generation quality; instruction tuning and CPT+SFT were preferred by LLM-based evaluation. The study also demonstrated effective cross-lingual transfer from French adaptation to English benchmarks, ultimately providing practical guidelines for selecting adaptation strategies under computational constraints.
Key takeaway
For Machine Learning Engineers adapting LLMs for medical or specialized question-answering, your strategy should align with the QA task type and computational budget. If you are building multiple-choice QA systems, supervised fine-tuning (SFT) offers a cost-effective default with minimal performance trade-offs. For open-ended QA, prioritize continual pretraining (CPT) or CPT+SFT to improve generation quality, especially when LLM-as-a-Judge evaluation is critical. Expect effective cross-lingual transfer for broader applicability.
Key insights
For medical LLM adaptation, SFT is cost-effective for MCQA, but CPT or CPT+SFT is preferred for OEQA, showing cross-lingual transfer.
Principles
- Adaptation effects should be disentangled from base model choice.
- CPT consistently improves overlap-based metrics in OEQA.
- SFT can degrade open-ended generation quality.
Method
Empirically compare CPT, SFT, and CPT+SFT across model families, sizes, and initialization types. Evaluate MCQA and OEQA using automatic metrics and LLM-as-a-Judge under greedy/constrained decoding.
In practice
- Use SFT for cost-effective MCQA adaptation.
- Prefer CPT or CPT+SFT for OEQA generation quality.
- Expect cross-lingual transfer from domain adaptation.
Topics
- Medical LLM Adaptation
- French QA
- Continual Pretraining
- Supervised Fine-tuning
- Question Answering
- Cross-lingual Transfer
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.