Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research, Medical Devices & Health Technology · Depth: Expert, extended

Summary

A new Variance-Aware Reward Framework significantly improves heart-focused medical question answering in Large Language Models (LLMs) by enhancing Group Relative Policy Optimization (GRPO). This framework, developed by Arash Ahmadi et al., extends existing rubric-based supervision by replacing weighted binary criterion aggregation and single Likert-style scoring with continuous analytical reward functions. This provides richer, more stable optimization signals for sparse, multi-criteria feedback. Applied to a Qwen3-14B base model, the best GRPO variant achieved an accuracy of 0.502 and an F1 score of 0.668 on a held-out heart-related HealthBench subset, a substantial improvement from the base model's 0.362 accuracy and 0.532 F1. This performance is competitive with the much larger GPT-OSS-120B (0.508 accuracy, 0.674 F1), while the optimized 14B parameter model remains deployable on a single workstation GPU like the NVIDIA RTX 6000 PRO.

Key takeaway

For Machine Learning Engineers developing medical LLMs, if you are struggling with sparse rewards in reinforcement learning for multi-criteria clinical tasks, consider implementing variance-aware rubric rewards. This approach, demonstrated to significantly boost accuracy on heart-focused QA, allows smaller models like Qwen3-14B to achieve performance comparable to much larger frontier models while remaining deployable on single workstation GPUs. You should design continuous reward functions that account for partial credit and rubric complexity.

Key insights

Continuous, variance-aware rubric rewards enable stable reinforcement learning for LLMs in complex, multi-criteria medical QA tasks.

Principles

Sparse, binary rewards hinder RL optimization in multi-criteria tasks.
Partial credit and complexity awareness improve reward signals.
Domain-specific post-training can match larger general models.

Method

The method involves supervised fine-tuning for structured output, followed by GRPO. It uses an LLM judge for criterion-level binary decisions, which are then transformed into continuous, variance-aware rewards (Complexity-aware or Hybrid) for policy optimization.

In practice

Implement continuous reward functions for multi-criteria RL.
Use SFT as a warm-start for structured LLM outputs.
Filter and augment domain-specific datasets for relevance.

Topics

Large Language Models
Medical Question Answering
Reinforcement Learning
Group Relative Policy Optimization
Rubric-based Rewards
Healthcare AI
Qwen3-14B

Code references

INQUIRELAB/variance-aware-rubric-rewards-grpo

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.