Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

2026-03-23 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

Weight-Decomposed Low-Rank Adaptation (DoRA) is an extension of LoRA that separates weight magnitude from direction, but its standard implementation in frameworks like Hugging Face PEFT requires materializing a dense product (BA) for the row-wise norm of W + sBA. This computation is memory-intensive, consuming approximately 512 MB of transient memory for a single module at d_in = 8192 and rank r = 384 in bf16, making high-rank DoRA impractical on common single-GPU setups. Researchers introduce two system contributions: a factored norm that decomposes the squared norm into O(d_out r + r^2) intermediates, eliminating the dense product, and fused Triton kernels that collapse the four-kernel DoRA composition into a single pass. This fused implementation achieves 1.5-2.0x faster inference and 1.5-1.9x faster gradient computation, with up to 7 GB lower peak VRAM across various 8-32B vision-language models on NVIDIA GPUs.

Key takeaway

For AI Engineers deploying high-rank DoRA models, adopting factored norms and fused Triton kernels is crucial. This approach dramatically reduces peak VRAM by up to 7 GB and accelerates both inference and training by 1.5-2.0x, making previously infeasible high-rank DoRA configurations viable on single-GPU setups. Prioritize integrating these optimizations to enhance performance and resource efficiency.

Key insights

Factored norms and fused kernels significantly optimize high-rank DoRA, reducing memory and improving speed.

Principles

Decouple weight magnitude from direction.
Avoid dense product materialization.
Fuse kernels to reduce memory traffic.

Method

Decompose the squared norm into base, cross, and Gram terms, then collapse the four-kernel DoRA composition into a single, numerically stable Triton kernel.

In practice

Use factored norms for DoRA.
Implement fused Triton kernels.
Apply to vision-language models.

Topics

DoRA
Low-Rank Adaptation
Memory Optimization
Fused Kernels
Vision-Language Models

Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.