S2LC – 100 LoRA adapters in 3.59ms by reconstructing weights in GPU registers, never writing to HBM

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

S2LC (Shared Spectral Low-Rank Compression) is a novel method that exploits shared spectral structure across neural network modules to efficiently serve numerous LoRA adapters. It computes a shared basis matrix V_common once per layer, encoding each module's unique contribution U_k into compact codebooks at approximately 3 bits per element. During inference, a fused Triton kernel reconstructs U_k values directly in GPU registers during tiled GEMM, eliminating intermediate HBM writes and achieving 10.1x memory compression over standard LoRA. This approach enables a 3.59 ms forward-pass latency for 100 concurrent adapters, with zero intermediate HBM writes verified by NVIDIA Nsight Compute. Future extensions are proposed for MoE expert compression, KV cache compression, and variable-depth serving.

Key takeaway

S2LC (Shared Spectral Low-Rank Compression) enables serving 100 LoRA adapters in just 3.59ms by reconstructing weights directly in GPU registers. This method achieves 10.1x memory compression over standard LoRA and eliminates intermediate HBM writes via a fused Triton kernel and CUDA Graph capture. It offers a significant performance boost for multi-adapter inference, with theoretical extensions to MoE and KV cache compression.

Topics

Code references

Best for: AI Engineer, NLP Engineer, MLOps Engineer, Machine Learning Engineer, Deep Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.