Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Health & Medical Research · Depth: Advanced, extended

Summary

The study investigates medical domain adaptation strategies for Large Language Models (LLMs) in French medical Question-Answering (QA). Researchers compared continual pretraining (CPT), supervised fine-tuning (SFT), and their combination (CPT+SFT) across three model families (Mistral-7B, Gemma-4B, Llama-7B/13B), multiple sizes, and three initialization types (General, Instruct, Medical). Evaluation covered multiple-choice QA (MCQA) and open-ended QA (OEQA) using automatic metrics and LLM-as-a-Judge. For MCQA, CPT+SFT often achieved the best scores, but gains over SFT were small and frequently not statistically significant, making SFT a cost-effective default. For OEQA, CPT consistently improved overlap-based metrics, while SFT often degraded quality; LLM-based evaluation preferred instruction tuning and CPT+SFT. Cross-lingual experiments showed effective transfer from French adaptation to English benchmarks. The study also found that translated benchmarks inflate accuracy and confidence.

Key takeaway

For Machine Learning Engineers adapting LLMs for French medical QA, you should prioritize Supervised Fine-Tuning (SFT) on labeled data for multiple-choice tasks. SFT offers the best performance-efficiency trade-off, often matching CPT+SFT gains with significantly lower computational costs (e.g., \$360 vs. \$1,500 for 7B models). If open-ended QA is critical, consider Continual Pretraining (CPT) for better lexical overlap, but interpret results cautiously due to verbosity bias. Be aware that translated benchmarks can inflate accuracy.

Key insights

SFT offers the best performance-efficiency trade-off for medical MCQA, while CPT benefits OEQA but with caveats.

Principles

CPT+SFT yields highest MCQA scores, but gains over SFT are often marginal.
Instruction-tuned models are strong baselines for French medical MCQA.
Translated benchmarks inflate accuracy and alter confidence calibration.

Method

The study systematically compared CPT, SFT, and CPT+SFT on French medical QA using Mistral, Gemma, and Llama models, varying initialization and evaluating MCQA/OEQA with constrained/greedy decoding and LLM-as-a-Judge.

In practice

Prioritize SFT for medical MCQA due to cost-effectiveness.
Use CPT for OEQA to improve lexical overlap metrics.
Be cautious with translated benchmarks for evaluation.

Topics

Medical LLMs
Domain Adaptation
Supervised Fine-Tuning
Continual Pretraining
French Medical QA
Cross-lingual Transfer

Code references

ikram28/MedAdapt

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.