GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

GRZO, a Group-Relative Zeroth-Order optimizer, offers a memory-efficient alternative to backpropagation for fine-tuning large language models, addressing the high variance typically associated with zeroth-order (ZO) optimization. This new optimizer operates by drawing one pseudo-independent perturbation per mini-batch example and aggregating per-example losses through group-relative normalization. This method effectively increases the gradient-direction count to the batch size without incurring additional forward computational cost, while also preserving inference-level memory. GRZO is proven to be directionally unbiased, with its variance shrinking proportionally to the batch size, leading to a tighter nonconvex convergence bound compared to MeZO. Across evaluations on RoBERTa-large, Llama3-8B, and OPT-13B, GRZO improved average accuracy on Llama3-8B by +3.0 over MeZO, utilizing 23% lower peak GPU memory. Furthermore, as a direct replacement for the MeZO core, GRZO enhanced sparse, low-rank, and quantized ZO variants by an average of +6.0.

Key takeaway

For Machine Learning Engineers fine-tuning large language models under memory constraints, GRZO offers a compelling alternative to traditional backpropagation. You should consider adopting GRZO to achieve significant accuracy gains, such as +3.0 on Llama3-8B, while simultaneously reducing peak GPU memory usage by 23% compared to MeZO. This allows for more efficient model deployment and experimentation on resource-limited hardware.

Key insights

GRZO significantly reduces gradient estimation variance in zeroth-order optimization for memory-efficient LLM fine-tuning.

Principles

Zeroth-order optimization offers memory efficiency for LLMs.
Group-relative normalization boosts effective gradient-direction count.
Variance reduction in ZO improves nonconvex convergence bounds.

Method

GRZO draws one pseudo-independent perturbation per mini-batch example, then aggregates per-example losses using group-relative normalization.

In practice

Apply GRZO to fine-tune Llama3-8B with 23% less GPU memory.
Integrate GRZO as a drop-in replacement for MeZO core.

Topics

GRZO
Zeroth-Order Optimization
LLM Fine-tuning
Gradient Estimation
GPU Memory Efficiency
MeZO

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.