Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

Weight-Decomposed Low-Rank Adaptation (DoRA) is an extension of LoRA that separates weight magnitude from direction, but its standard implementation in frameworks like Hugging Face PEFT requires materializing a dense product (BA) for the row-wise norm of W + sBA. This computation is memory-intensive, consuming approximately 512 MB of transient memory for a single module at d_in = 8192 and rank r = 384 in bf16, making high-rank DoRA impractical on common single-GPU setups. Researchers introduce two system contributions: a factored norm that decomposes the squared norm into O(d_out r + r^2) intermediates, eliminating the dense product, and fused Triton kernels that collapse the four-kernel DoRA composition into a single pass. This fused implementation achieves 1.5-2.0x faster inference and 1.5-1.9x faster gradient computation, with up to 7 GB lower peak VRAM across various 8-32B vision-language models on NVIDIA GPUs.

Key takeaway

For AI Engineers deploying high-rank DoRA models, adopting factored norms and fused Triton kernels is crucial. This approach dramatically reduces peak VRAM by up to 7 GB and accelerates both inference and training by 1.5-2.0x, making previously infeasible high-rank DoRA configurations viable on single-GPU setups. Prioritize integrating these optimizations to enhance performance and resource efficiency.

Key insights

Factored norms and fused kernels significantly optimize high-rank DoRA, reducing memory and improving speed.

Principles

Method

Decompose the squared norm into base, cross, and Gram terms, then collapse the four-kernel DoRA composition into a single, numerically stable Triton kernel.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.