Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A systematic empirical study investigates subword tokenizers for multilingual large language models (LLMs), addressing biases that favor high-resource languages and Latin scripts, which inflate inference costs and widen capability gaps for underrepresented languages, particularly across 11 Southeast Asian languages. The research, involving controlled 1.5B-parameter language model training, compares various equitable tokenizers on a unified benchmark. Findings indicate that Parity-aware BPE offers strong compression parity and competitive cost, positioning it on the Pareto frontier for efficiency-equity. Morphology-Driven Byte Encoding provides superior semantic reasoning through richer morphological representations, albeit at higher computational expense. Byte Latent Transformer underperforms on downstream tasks. The study concludes that cross-lingual fairness and tokenization efficiency are not inherently conflicting goals.

Key takeaway

For machine learning engineers designing multilingual LLMs, your tokenizer choice critically impacts fairness and efficiency for underrepresented languages. Prioritize Parity-aware BPE for a strong balance of compression and equity, or select Morphology-Driven Byte Encoding for superior semantic reasoning, especially when targeting Southeast Asian languages. Avoid Byte Latent Transformer in low-resource training scenarios, as it underperforms on downstream tasks.

Key insights

Cross-lingual fairness and tokenization efficiency are not fundamentally at odds in multilingual LLMs.

Principles

Method

Systematic comparison of equitable tokenizers on a unified benchmark across 11 Southeast Asian languages, assessing compression, equity, and downstream task performance with 1.5B-parameter models.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.