Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research, Medical Devices & Health Technology · Depth: Expert, extended

Summary

A new Variance-Aware Reward Framework significantly improves heart-focused medical question answering in Large Language Models (LLMs) by enhancing Group Relative Policy Optimization (GRPO). This framework, developed by Arash Ahmadi et al., extends existing rubric-based supervision by replacing weighted binary criterion aggregation and single Likert-style scoring with continuous analytical reward functions. This provides richer, more stable optimization signals for sparse, multi-criteria feedback. Applied to a Qwen3-14B base model, the best GRPO variant achieved an accuracy of 0.502 and an F1 score of 0.668 on a held-out heart-related HealthBench subset, a substantial improvement from the base model's 0.362 accuracy and 0.532 F1. This performance is competitive with the much larger GPT-OSS-120B (0.508 accuracy, 0.674 F1), while the optimized 14B parameter model remains deployable on a single workstation GPU like the NVIDIA RTX 6000 PRO.

Key takeaway

For Machine Learning Engineers developing medical LLMs, if you are struggling with sparse rewards in reinforcement learning for multi-criteria clinical tasks, consider implementing variance-aware rubric rewards. This approach, demonstrated to significantly boost accuracy on heart-focused QA, allows smaller models like Qwen3-14B to achieve performance comparable to much larger frontier models while remaining deployable on single workstation GPUs. You should design continuous reward functions that account for partial credit and rubric complexity.

Key insights

Continuous, variance-aware rubric rewards enable stable reinforcement learning for LLMs in complex, multi-criteria medical QA tasks.

Principles

Method

The method involves supervised fine-tuning for structured output, followed by GRPO. It uses an LLM judge for criterion-level binary decisions, which are then transformed into continuous, variance-aware rewards (Complexity-aware or Hybrid) for policy optimization.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.