Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
Summary
A new Variance-Aware Reward Framework significantly improves heart-focused medical question answering in Large Language Models (LLMs) by enhancing Group Relative Policy Optimization (GRPO). This framework, developed by Arash Ahmadi et al., extends existing rubric-based supervision by replacing weighted binary criterion aggregation and single Likert-style scoring with continuous analytical reward functions. This provides richer, more stable optimization signals for sparse, multi-criteria feedback. Applied to a Qwen3-14B base model, the best GRPO variant achieved an accuracy of 0.502 and an F1 score of 0.668 on a held-out heart-related HealthBench subset, a substantial improvement from the base model's 0.362 accuracy and 0.532 F1. This performance is competitive with the much larger GPT-OSS-120B (0.508 accuracy, 0.674 F1), while the optimized 14B parameter model remains deployable on a single workstation GPU like the NVIDIA RTX 6000 PRO.
Key takeaway
For Machine Learning Engineers developing medical LLMs, if you are struggling with sparse rewards in reinforcement learning for multi-criteria clinical tasks, consider implementing variance-aware rubric rewards. This approach, demonstrated to significantly boost accuracy on heart-focused QA, allows smaller models like Qwen3-14B to achieve performance comparable to much larger frontier models while remaining deployable on single workstation GPUs. You should design continuous reward functions that account for partial credit and rubric complexity.
Key insights
Continuous, variance-aware rubric rewards enable stable reinforcement learning for LLMs in complex, multi-criteria medical QA tasks.
Principles
- Sparse, binary rewards hinder RL optimization in multi-criteria tasks.
- Partial credit and complexity awareness improve reward signals.
- Domain-specific post-training can match larger general models.
Method
The method involves supervised fine-tuning for structured output, followed by GRPO. It uses an LLM judge for criterion-level binary decisions, which are then transformed into continuous, variance-aware rewards (Complexity-aware or Hybrid) for policy optimization.
In practice
- Implement continuous reward functions for multi-criteria RL.
- Use SFT as a warm-start for structured LLM outputs.
- Filter and augment domain-specific datasets for relevance.
Topics
- Large Language Models
- Medical Question Answering
- Reinforcement Learning
- Group Relative Policy Optimization
- Rubric-based Rewards
- Healthcare AI
- Qwen3-14B
Code references
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.