Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

An empirical study investigates trade-offs in medical Large Language Model (LLM) adaptation, focusing on French medical question-answering (QA). Researchers compared continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types. For multiple-choice QA (MCQA), CPT+SFT generally achieved the highest scores, but SFT alone proved a strong and cost-effective default due to small, often non-significant gains from CPT. In open-ended QA (OEQA), CPT consistently improved overlap-based metrics, while SFT frequently degraded generation quality; instruction tuning and CPT+SFT were preferred by LLM-based evaluation. The study also demonstrated effective cross-lingual transfer from French adaptation to English benchmarks, ultimately providing practical guidelines for selecting adaptation strategies under computational constraints.

Key takeaway

For Machine Learning Engineers adapting LLMs for medical or specialized question-answering, your strategy should align with the QA task type and computational budget. If you are building multiple-choice QA systems, supervised fine-tuning (SFT) offers a cost-effective default with minimal performance trade-offs. For open-ended QA, prioritize continual pretraining (CPT) or CPT+SFT to improve generation quality, especially when LLM-as-a-Judge evaluation is critical. Expect effective cross-lingual transfer for broader applicability.

Key insights

For medical LLM adaptation, SFT is cost-effective for MCQA, but CPT or CPT+SFT is preferred for OEQA, showing cross-lingual transfer.

Principles

Adaptation effects should be disentangled from base model choice.
CPT consistently improves overlap-based metrics in OEQA.
SFT can degrade open-ended generation quality.

Method

Empirically compare CPT, SFT, and CPT+SFT across model families, sizes, and initialization types. Evaluate MCQA and OEQA using automatic metrics and LLM-as-a-Judge under greedy/constrained decoding.

In practice

Use SFT for cost-effective MCQA adaptation.
Prefer CPT or CPT+SFT for OEQA generation quality.
Expect cross-lingual transfer from domain adaptation.

Topics

Medical LLM Adaptation
French QA
Continual Pretraining
Supervised Fine-tuning
Question Answering
Cross-lingual Transfer

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.