LLM Compression by Block Removal with Constrained Binary Optimization
Summary
The paper introduces a novel LLM compression method called Constrained Binary Optimization (CBO) for removing entire transformer blocks. This approach formulates block removal as a combinatorial problem mapped to an Ising model, where low-energy solutions correlate with high downstream model performance. CBO efficiently ranks numerous candidate configurations, yielding high-quality, non-trivial solutions beyond consecutive block regions. The method outperforms existing block-removal techniques on Llama-3.1-8B-Instruct and Qwen3-14B, showing MMLU score improvements of up to 6 points after short retraining. It also generalizes to heterogeneous architectures like NVIDIA-Nemotron-3-Nano-30B-A3B-FP8, retaining 94% (91%) and 88% (74%) accuracy on GPQA and AIME25 respectively, even without retraining for 2 (3) removed MoE blocks out of 23.
Key takeaway
For Machine Learning Engineers optimizing LLM deployment on resource-constrained devices, consider adopting the Constrained Binary Optimization (CBO) method for structural pruning. You should explore not only the lowest-energy block removal configurations but also low-lying excited states, as these often yield superior performance across benchmarks like MMLU, especially for models like Llama-3.1-8B-Instruct or Qwen3-14B. This approach offers significant gains in general knowledge retention and inference efficiency, even for complex MoE architectures.
Key insights
Block removal in LLMs can be optimized by mapping it to a constrained binary problem, revealing diverse high-performing configurations.
Principles
- Block interactions are crucial for optimal pruning.
- Low-energy states in CBO yield high-quality solutions.
- The exact ground state is not always optimal.
Method
Formulate block removal as a CBO problem using a second-order Taylor expansion of the loss with auxiliary block-selection variables. Approximate the Hessian and solve the resulting Ising model to find low-energy block configurations.
In practice
- Apply CBO to Llama-3.1-8B-Instruct or Qwen3-14B.
- Explore low-lying excited states for diverse pruning.
- Use for heterogeneous MoE architectures like Nemotron-3-Nano.
Topics
- LLM Compression
- Block Pruning
- Constrained Binary Optimization
- Ising Model
- Transformer Architectures
- Mixture-of-Experts
Code references
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.