Lost in Quantization: Activation Outliers Explain Language-Specific FP8 Sensitivity in Llama-3
Summary
A study compared INT8 and FP8 (E4M3) quantization for the Meta-Llama-3-8B model across English and Brazilian Portuguese (PT-BR) to understand language-specific effects on efficiency. INT8 quantization, when combined with outlier handling, maintained perplexity for both languages. However, naive FP8 casting significantly degraded English performance, showing an 18% increase in perplexity, compared to a 3.9% increase for PT-BR. Activation analysis revealed that English text generated rarer but larger activation spikes, exceeding 35, which are more susceptible to saturation with unscaled E4M3 quantization. In contrast, PT-BR activations were found to be more concentrated, explaining its lower sensitivity to naive FP8.
Key takeaway
For AI engineers optimizing LLM inference, understanding language-specific quantization effects is crucial. If your models primarily process English, be aware that naive FP8 casting can lead to substantial performance degradation due to activation outliers. Prioritize robust quantization techniques, such as INT8 with outlier handling, or investigate calibrated/scaled FP8 recipes to mitigate these language-specific sensitivities and maintain model accuracy.
Key insights
Activation outliers explain language-specific FP8 quantization sensitivity in LLMs like Llama-3.
Principles
- Naive FP8 casting degrades English LLM performance more than Portuguese.
- Rarer, larger activation spikes increase FP8 saturation risk.
Method
The study compared INT8 with outlier handling and naive FP8 (E4M3) casting on Meta-Llama-3-8B for English and Brazilian Portuguese, analyzing perplexity and activation distributions.
In practice
- Consider outlier handling for INT8 quantization.
- Evaluate FP8 quantization with language-specific datasets.
Topics
- Quantization
- LLM Inference
- FP8 Quantization
- Activation Outliers
- Llama-3
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.