SimDiff: Depth Pruning via Similarity and Difference
Summary
SimDiff is a novel depth pruning criterion for large language models (LLMs) that enhances deployment efficiency by identifying and removing redundant layers. Unlike traditional methods that rely solely on cosine distance for layer similarity, SimDiff evaluates layers using two orthogonal perspectives: representational similarity and transformation difference. The transformation difference is quantified by two metrics: MSSD, which detects layers making decisive corrections and is sensitive to outliers, and MASD, which robustly measures a layer's average contribution. Experiments on models ranging from 0.5B to 13B parameters show SimDiff significantly outperforms existing baselines across various pruning ratios. For instance, it retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B, with pruned models effectively recoverable via minimal fine-tuning.
Key takeaway
For AI Engineers optimizing LLM deployment, SimDiff offers a robust alternative to traditional depth pruning methods. Its dual-perspective approach, incorporating both similarity and transformation difference, can yield significant inference speedups and performance retention, such as 1.49x faster inference on LLaMA3.1-8B. You should consider integrating SimDiff into your pruning workflows to achieve better efficiency and model recovery with minimal fine-tuning, especially for models like LLaMA2-7B.
Key insights
SimDiff improves LLM depth pruning by combining representational similarity with two transformation difference metrics.
Principles
- Solely using cosine distance for layer similarity can lead to unpredictable pruning performance.
- Orthogonal evaluation perspectives enhance layer importance assessment.
Method
SimDiff jointly evaluates layers using representational similarity and transformation difference, quantified by MSSD (outlier-sensitive) and MASD (average contribution) metrics.
In practice
- Achieves 1.49x inference speedup on LLaMA3.1-8B.
- Retains >91% LLaMA2-7B performance at 25% pruning.
- Pruned models are recoverable with minimal fine-tuning.
Topics
- Depth Pruning
- Large Language Models
- SimDiff
- Representational Similarity
- Transformation Difference
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.