LLM Compression by Block Removal with Constrained Binary Optimization

2026-01-14 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

The paper introduces a novel LLM compression method called Constrained Binary Optimization (CBO) for removing entire transformer blocks. This approach formulates block removal as a combinatorial problem mapped to an Ising model, where low-energy solutions correlate with high downstream model performance. CBO efficiently ranks numerous candidate configurations, yielding high-quality, non-trivial solutions beyond consecutive block regions. The method outperforms existing block-removal techniques on Llama-3.1-8B-Instruct and Qwen3-14B, showing MMLU score improvements of up to 6 points after short retraining. It also generalizes to heterogeneous architectures like NVIDIA-Nemotron-3-Nano-30B-A3B-FP8, retaining 94% (91%) and 88% (74%) accuracy on GPQA and AIME25 respectively, even without retraining for 2 (3) removed MoE blocks out of 23.

Key takeaway

For Machine Learning Engineers optimizing LLM deployment on resource-constrained devices, consider adopting the Constrained Binary Optimization (CBO) method for structural pruning. You should explore not only the lowest-energy block removal configurations but also low-lying excited states, as these often yield superior performance across benchmarks like MMLU, especially for models like Llama-3.1-8B-Instruct or Qwen3-14B. This approach offers significant gains in general knowledge retention and inference efficiency, even for complex MoE architectures.

Key insights

Block removal in LLMs can be optimized by mapping it to a constrained binary problem, revealing diverse high-performing configurations.

Principles

Block interactions are crucial for optimal pruning.
Low-energy states in CBO yield high-quality solutions.
The exact ground state is not always optimal.

Method

Formulate block removal as a CBO problem using a second-order Taylor expansion of the loss with auxiliary block-selection variables. Approximate the Hessian and solve the resulting Ising model to find low-energy block configurations.

In practice

Apply CBO to Llama-3.1-8B-Instruct or Qwen3-14B.
Explore low-lying excited states for diverse pruning.
Use for heterogeneous MoE architectures like Nemotron-3-Nano.

Topics

LLM Compression
Block Pruning
Constrained Binary Optimization
Ising Model
Transformer Architectures
Mixture-of-Experts

Code references

NVIDIA-NeMo/Skills

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.