LLM Zeroth-Order Fine-Tuning is an Inference Workload
Summary
The paper "LLM Zeroth-Order Fine-Tuning is an Inference Workload" by Zelin Li and Caiwen Ding proposes a novel approach to accelerate zeroth-order (ZO) fine-tuning for large language models (LLMs). It identifies that ZO fine-tuning's dominant work involves repeated scoring under nearby parameter states, which is an inference-dominated workload. By executing this repeated scoring phase through a serving runtime, specifically vLLM, the authors achieve significant speedups. For instance, on OPT-13B SST-2, a 20k-step LoZO run completed in 0.51 estimated training hours, an 8.13x speedup compared to the 4.15 hours for the official LoZO baseline, while maintaining 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. Core-step scaling experiments across OPT-1.3B to OPT-13B showed 2.34x–7.72x speedups. This runtime reorganization also accelerated a MeZO-style experiment by up to 2.55x, suggesting a practical path toward inference-time training.
Key takeaway
For Machine Learning Engineers optimizing LLM fine-tuning, you should re-evaluate zeroth-order methods by considering them as inference workloads. Implementing ZO fine-tuning through a serving runtime like vLLM can yield significant speedups, such as 8.13x on OPT-13B, while maintaining accuracy. This approach enables more efficient experimentation and potentially facilitates lightweight, inference-time model adaptation, reducing the need for separate, resource-intensive training jobs. Explore integrating dynamic adapter states for future adaptation strategies.
Key insights
LLM zeroth-order fine-tuning can be re-architected as an inference workload for substantial speedups.
Principles
- ZO fine-tuning is inference-dominated.
- Workload-runtime mismatch hinders ZO efficiency.
- Dynamic adapter states enable inference-time training.
Method
Reorganize LLM zeroth-order fine-tuning by executing its repeated scoring phase through a serving runtime like vLLM, treating ZO updates as dynamic adapter states.
In practice
- Use vLLM for ZO fine-tuning acceleration.
- Achieve 8.13x speedup on OPT-13B SST-2.
- Consider inference-time training for lightweight adaptation.
Topics
- Large Language Models
- Zeroth-Order Optimization
- LLM Fine-Tuning
- Inference Optimization
- vLLM
- LoRA
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.