From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SubFit introduces a novel approach to Large Language Model (LLM) compression, moving beyond the restrictive full-layer granularity and contiguous selection of existing replacement-based methods. This new technique, Submodule-level Fitted residual replacement, compresses LLMs by targeting individual Attention and FeedForward submodules non-contiguously, each receiving a lightweight fitted residual bypass. Operating post-training with only calibration data, SubFit was rigorously evaluated across ten LLMs, including five base and five instruction-tuned models, and five sparsity levels ranging from 12.5% to 37.5%. It consistently achieved the best aggregate perplexity-accuracy trade-off compared to four replacement-based baselines, demonstrating significant gains under aggressive compression. Specifically, at 25% sparsity, SubFit maintained 84.6% of dense downstream accuracy with only 2.42x perplexity degradation, substantially outperforming the strongest baselines which showed 81.6% accuracy and 4.34x degradation. The method also delivers measurable inference speedup and KV-cache savings.

Key takeaway

For Machine Learning Engineers optimizing LLM deployment, SubFit presents a compelling compression alternative. If you are struggling with existing full-layer methods, consider adopting submodule-level compression to achieve significantly better perplexity-accuracy trade-offs, especially under aggressive sparsity. This approach can reduce your inference latency and KV-cache footprint more effectively, allowing for more efficient model deployment on resource-constrained hardware.

Key insights

LLM redundancy is submodule-level and non-contiguous, enabling better compression via fitted residual bypasses.

Principles

Method

SubFit selects Attention and FeedForward submodules non-contiguously, fitting each with its own lightweight residual bypass. It operates post-training using only calibration data.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.