From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
Summary
SubFit introduces a novel approach to Large Language Model (LLM) compression, moving beyond the restrictive full-layer granularity and contiguous selection of existing replacement-based methods. This new technique, Submodule-level Fitted residual replacement, compresses LLMs by targeting individual Attention and FeedForward submodules non-contiguously, each receiving a lightweight fitted residual bypass. Operating post-training with only calibration data, SubFit was rigorously evaluated across ten LLMs, including five base and five instruction-tuned models, and five sparsity levels ranging from 12.5% to 37.5%. It consistently achieved the best aggregate perplexity-accuracy trade-off compared to four replacement-based baselines, demonstrating significant gains under aggressive compression. Specifically, at 25% sparsity, SubFit maintained 84.6% of dense downstream accuracy with only 2.42x perplexity degradation, substantially outperforming the strongest baselines which showed 81.6% accuracy and 4.34x degradation. The method also delivers measurable inference speedup and KV-cache savings.
Key takeaway
For Machine Learning Engineers optimizing LLM deployment, SubFit presents a compelling compression alternative. If you are struggling with existing full-layer methods, consider adopting submodule-level compression to achieve significantly better perplexity-accuracy trade-offs, especially under aggressive sparsity. This approach can reduce your inference latency and KV-cache footprint more effectively, allowing for more efficient model deployment on resource-constrained hardware.
Key insights
LLM redundancy is submodule-level and non-contiguous, enabling better compression via fitted residual bypasses.
Principles
- Redundancy in transformers is not contiguous.
- Redundancy varies between Attention and FeedForward.
- Different strategies suit different submodule types.
Method
SubFit selects Attention and FeedForward submodules non-contiguously, fitting each with its own lightweight residual bypass. It operates post-training using only calibration data.
In practice
- Achieve better perplexity-accuracy trade-off.
- Enable aggressive LLM compression.
- Reduce inference latency and KV-cache.
Topics
- LLM Compression
- Submodule Sparsity
- Transformer Architecture
- Perplexity-Accuracy
- Inference Optimization
- SubFit
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.