From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SubFit introduces a novel approach to Large Language Model (LLM) compression, moving beyond the restrictive full-layer granularity and contiguous selection of existing replacement-based methods. This new technique, Submodule-level Fitted residual replacement, compresses LLMs by targeting individual Attention and FeedForward submodules non-contiguously, each receiving a lightweight fitted residual bypass. Operating post-training with only calibration data, SubFit was rigorously evaluated across ten LLMs, including five base and five instruction-tuned models, and five sparsity levels ranging from 12.5% to 37.5%. It consistently achieved the best aggregate perplexity-accuracy trade-off compared to four replacement-based baselines, demonstrating significant gains under aggressive compression. Specifically, at 25% sparsity, SubFit maintained 84.6% of dense downstream accuracy with only 2.42x perplexity degradation, substantially outperforming the strongest baselines which showed 81.6% accuracy and 4.34x degradation. The method also delivers measurable inference speedup and KV-cache savings.

Key takeaway

For Machine Learning Engineers optimizing LLM deployment, SubFit presents a compelling compression alternative. If you are struggling with existing full-layer methods, consider adopting submodule-level compression to achieve significantly better perplexity-accuracy trade-offs, especially under aggressive sparsity. This approach can reduce your inference latency and KV-cache footprint more effectively, allowing for more efficient model deployment on resource-constrained hardware.

Key insights

LLM redundancy is submodule-level and non-contiguous, enabling better compression via fitted residual bypasses.

Principles

Redundancy in transformers is not contiguous.
Redundancy varies between Attention and FeedForward.
Different strategies suit different submodule types.

Method

SubFit selects Attention and FeedForward submodules non-contiguously, fitting each with its own lightweight residual bypass. It operates post-training using only calibration data.

In practice

Achieve better perplexity-accuracy trade-off.
Enable aggressive LLM compression.
Reduce inference latency and KV-cache.

Topics

LLM Compression
Submodule Sparsity
Transformer Architecture
Perplexity-Accuracy
Inference Optimization
SubFit

Code references

eliacunegatti/SubFit

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.