Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports

2024-05-13 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Medical Natural Language Processing · Depth: Expert, extended

Summary

A new two-stage approach significantly improves disease classification accuracy and reasoning capabilities in lightweight Large Language Models (LLMs) analyzing radiology reports. The method first applies Supervised Fine-Tuning (SFT) using only disease labels, followed by Group Relative Policy Optimization (GRPO) to refine predictions and encourage explicit reasoning without direct reasoning supervision. This framework was evaluated on three radiologist-annotated datasets: MIMIC-CXR, NIH-CXR, and MIDRC, using lightweight LLMs like LLaMA 3.1-8B-Instruct, Qwen 2.5-3B-Instruct, and Phi-3 Min-128K-Instruct. Results show SFT consistently outperformed baselines, and GRPO further boosted micro-F1 scores in eight of nine cohorts, with gains up to 13.2%. GRPO also effectively restored and enhanced reasoning recall and comprehensiveness, which SFT alone often degraded due to catastrophic forgetting.

Key takeaway

For AI Engineers developing clinical NLP solutions, consider integrating a reinforcement learning stage like GRPO after initial supervised fine-tuning. This approach can significantly enhance both classification accuracy and the crucial explainability of lightweight LLMs in radiology, addressing the "catastrophic forgetting" of reasoning paths often seen with SFT alone. Your models will provide more reliable and transparent diagnostic support, fostering greater trust in AI-driven clinical decision-making.

Key insights

Reinforcement learning with GRPO enhances LLM accuracy and reasoning in medical text classification without explicit reasoning supervision.

Principles

SFT can degrade LLM reasoning capabilities.
Rule-based reward functions can guide LLM behavior.
Ensembling multiple inferences improves robustness.

Method

A two-stage pipeline: first, SFT on disease labels; second, GRPO with a reward function optimizing classification accuracy and output format, including explicit reasoning. Majority voting and summarization consolidate multiple inferences.

In practice

Implement GRPO after SFT for improved medical text classification.
Use rule-based rewards to encourage specific output formats.
Employ majority voting for robust disease classification.

Topics

Reinforcement Learning
LLM Fine-tuning
Radiology Report Analysis
Disease Classification
Group Relative Policy Optimization

Code references

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.