LLM Zeroth-Order Fine-Tuning is an Inference Workload

2026-05-27 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, medium

Summary

The paper "LLM Zeroth-Order Fine-Tuning is an Inference Workload" by Zelin Li and Caiwen Ding proposes a novel approach to accelerate zeroth-order (ZO) fine-tuning for large language models (LLMs). It identifies that ZO fine-tuning's dominant work involves repeated scoring under nearby parameter states, which is an inference-dominated workload. By executing this repeated scoring phase through a serving runtime, specifically vLLM, the authors achieve significant speedups. For instance, on OPT-13B SST-2, a 20k-step LoZO run completed in 0.51 estimated training hours, an 8.13x speedup compared to the 4.15 hours for the official LoZO baseline, while maintaining 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. Core-step scaling experiments across OPT-1.3B to OPT-13B showed 2.34x–7.72x speedups. This runtime reorganization also accelerated a MeZO-style experiment by up to 2.55x, suggesting a practical path toward inference-time training.

Key takeaway

For Machine Learning Engineers optimizing LLM fine-tuning, you should re-evaluate zeroth-order methods by considering them as inference workloads. Implementing ZO fine-tuning through a serving runtime like vLLM can yield significant speedups, such as 8.13x on OPT-13B, while maintaining accuracy. This approach enables more efficient experimentation and potentially facilitates lightweight, inference-time model adaptation, reducing the need for separate, resource-intensive training jobs. Explore integrating dynamic adapter states for future adaptation strategies.

Key insights

LLM zeroth-order fine-tuning can be re-architected as an inference workload for substantial speedups.

Principles

ZO fine-tuning is inference-dominated.
Workload-runtime mismatch hinders ZO efficiency.
Dynamic adapter states enable inference-time training.

Method

Reorganize LLM zeroth-order fine-tuning by executing its repeated scoring phase through a serving runtime like vLLM, treating ZO updates as dynamic adapter states.

In practice

Use vLLM for ZO fine-tuning acceleration.
Achieve 8.13x speedup on OPT-13B SST-2.
Consider inference-time training for lightweight adaptation.

Topics

Large Language Models
Zeroth-Order Optimization
LLM Fine-Tuning
Inference Optimization
vLLM
LoRA

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.