Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports
Summary
A new two-stage approach significantly improves disease classification accuracy and reasoning capabilities in lightweight Large Language Models (LLMs) analyzing radiology reports. The method first applies Supervised Fine-Tuning (SFT) using only disease labels, followed by Group Relative Policy Optimization (GRPO) to refine predictions and encourage explicit reasoning without direct reasoning supervision. This framework was evaluated on three radiologist-annotated datasets: MIMIC-CXR, NIH-CXR, and MIDRC, using lightweight LLMs like LLaMA 3.1-8B-Instruct, Qwen 2.5-3B-Instruct, and Phi-3 Min-128K-Instruct. Results show SFT consistently outperformed baselines, and GRPO further boosted micro-F1 scores in eight of nine cohorts, with gains up to 13.2%. GRPO also effectively restored and enhanced reasoning recall and comprehensiveness, which SFT alone often degraded due to catastrophic forgetting.
Key takeaway
For AI Engineers developing clinical NLP solutions, consider integrating a reinforcement learning stage like GRPO after initial supervised fine-tuning. This approach can significantly enhance both classification accuracy and the crucial explainability of lightweight LLMs in radiology, addressing the "catastrophic forgetting" of reasoning paths often seen with SFT alone. Your models will provide more reliable and transparent diagnostic support, fostering greater trust in AI-driven clinical decision-making.
Key insights
Reinforcement learning with GRPO enhances LLM accuracy and reasoning in medical text classification without explicit reasoning supervision.
Principles
- SFT can degrade LLM reasoning capabilities.
- Rule-based reward functions can guide LLM behavior.
- Ensembling multiple inferences improves robustness.
Method
A two-stage pipeline: first, SFT on disease labels; second, GRPO with a reward function optimizing classification accuracy and output format, including explicit reasoning. Majority voting and summarization consolidate multiple inferences.
In practice
- Implement GRPO after SFT for improved medical text classification.
- Use rule-based rewards to encourage specific output formats.
- Employ majority voting for robust disease classification.
Topics
- Reinforcement Learning
- LLM Fine-tuning
- Radiology Report Analysis
- Disease Classification
- Group Relative Policy Optimization
Code references
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.