Lost in Quantization: Activation Outliers Explain Language-Specific FP8 Sensitivity in Llama-3

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study compared INT8 and FP8 (E4M3) quantization for the Meta-Llama-3-8B model across English and Brazilian Portuguese (PT-BR) to understand language-specific effects on efficiency. INT8 quantization, when combined with outlier handling, maintained perplexity for both languages. However, naive FP8 casting significantly degraded English performance, showing an 18% increase in perplexity, compared to a 3.9% increase for PT-BR. Activation analysis revealed that English text generated rarer but larger activation spikes, exceeding 35, which are more susceptible to saturation with unscaled E4M3 quantization. In contrast, PT-BR activations were found to be more concentrated, explaining its lower sensitivity to naive FP8.

Key takeaway

For AI engineers optimizing LLM inference, understanding language-specific quantization effects is crucial. If your models primarily process English, be aware that naive FP8 casting can lead to substantial performance degradation due to activation outliers. Prioritize robust quantization techniques, such as INT8 with outlier handling, or investigate calibrated/scaled FP8 recipes to mitigate these language-specific sensitivities and maintain model accuracy.

Key insights

Activation outliers explain language-specific FP8 quantization sensitivity in LLMs like Llama-3.

Principles

Method

The study compared INT8 with outlier handling and naive FP8 (E4M3) casting on Meta-Llama-3-8B for English and Brazilian Portuguese, analyzing perplexity and activation distributions.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.