Two-dimensional early exit optimisation of LLM inference

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A novel two-dimensional (2D) early exit strategy has been introduced to optimize large language model (LLM) inference for classification tasks. This method coordinates layer-wise and sentence-wise exiting by incrementally processing input sentence-by-sentence while progressively activating deeper layers. This approach achieves multiplicative computational savings, outperforming independent optimization of either dimension. Experimental evaluations on four LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) across three sentiment classification datasets demonstrated additional speed-ups of 1.4–2.3× over optimal layer-wise early exit for simpler tasks with vanilla models. While fine-tuning reduced this advantage, it did not eliminate it. The 2D early exit strategy is model-agnostic, requires only lightweight classification adapters, and is orthogonal to other efficiency methods like quantization and pruning.

Key takeaway

For AI Engineers optimizing LLM inference for high-throughput classification tasks, adopting a 2D early exit strategy can yield significant speed-ups, particularly for simpler problems with vanilla models. You should evaluate this method as it offers multiplicative computational savings beyond traditional layer-wise early exiting, requiring only lightweight adapters. Be aware that fine-tuning may diminish the 2D advantage, so consider adapter-only training or specialized fine-tuning strategies that preserve early-layer accuracy gradients to maximize efficiency gains.

Key insights

Coordinating layer-wise and sentence-wise early exits multiplicatively boosts LLM inference speed for classification.

Principles

Method

Process input sentence-by-sentence, progressively activating deeper layers, and halt inference when accumulated confidence (difference between top two softmax outputs) exceeds a threshold.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.