Two-dimensional early exit optimisation of LLM inference
Summary
A novel two-dimensional (2D) early exit strategy has been introduced to optimize large language model (LLM) inference for classification tasks. This method coordinates layer-wise and sentence-wise exiting by incrementally processing input sentence-by-sentence while progressively activating deeper layers. This approach achieves multiplicative computational savings, outperforming independent optimization of either dimension. Experimental evaluations on four LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) across three sentiment classification datasets demonstrated additional speed-ups of 1.4–2.3× over optimal layer-wise early exit for simpler tasks with vanilla models. While fine-tuning reduced this advantage, it did not eliminate it. The 2D early exit strategy is model-agnostic, requires only lightweight classification adapters, and is orthogonal to other efficiency methods like quantization and pruning.
Key takeaway
For AI Engineers optimizing LLM inference for high-throughput classification tasks, adopting a 2D early exit strategy can yield significant speed-ups, particularly for simpler problems with vanilla models. You should evaluate this method as it offers multiplicative computational savings beyond traditional layer-wise early exiting, requiring only lightweight adapters. Be aware that fine-tuning may diminish the 2D advantage, so consider adapter-only training or specialized fine-tuning strategies that preserve early-layer accuracy gradients to maximize efficiency gains.
Key insights
Coordinating layer-wise and sentence-wise early exits multiplicatively boosts LLM inference speed for classification.
Principles
- Semantic information accumulates predictably across input structure.
- Fine-tuning can flatten accuracy gradients, reducing early exit benefits.
Method
Process input sentence-by-sentence, progressively activating deeper layers, and halt inference when accumulated confidence (difference between top two softmax outputs) exceeds a threshold.
In practice
- Implement lightweight classification adapters for each LLM layer.
- Tune τ_ignore (0.3-0.5) and τ_acc based on task complexity.
- Consider for tasks where semantic signal builds incrementally.
Topics
- Two-dimensional Early Exit
- LLM Inference Optimization
- Layer-wise Early Exit
- Sentence-wise Input Trimming
- Sentiment Classification
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.