BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base
Summary
BrahmicTokenizer-131K is a new 131,072-vocabulary byte-level BPE tokenizer designed as a drop-in replacement for OpenAI's o200k_base. It aims to close the Brahmic language compression gap while preserving strong performance on English, EU languages, and code. The tokenizer was constructed via a two-stage retrofit process: first, a script-prune crop reduced 200,019 tokens to 131,072 by removing nine out-of-scope writing systems; second, 2,372 corpus-dead vocabulary slots were surgically replaced with Brahmic Unicode blocks using linear programming. On 27 million Indic documents (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K achieves 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m, with Odia compression improving by 76.79%. It matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and outperforms Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K benchmarks. This tokenizer is uniquely competitive across Brahmic, English, EU, code, and math at its 131K budget, unlike specialist Indic tokenizers that sacrifice non-Indic performance. It is released under Apache 2.0.
Key takeaway
For Machine Learning Engineers developing multilingual Large Language Models, BrahmicTokenizer-131K offers a significant advantage. If your models require efficient tokenization across Indic, English, EU, and code languages, integrate this Apache 2.0 licensed tokenizer. It provides superior compression for Brahmic languages without sacrificing performance on other critical domains, unlike specialist Indic tokenizers. This can lead to more compact models and improved inference efficiency for diverse linguistic applications.
Key insights
BrahmicTokenizer-131K offers balanced, high-performance tokenization for Indic, English, EU, and code languages at a 131K vocabulary budget.
Principles
- Tokenizer optimization can target specific language groups.
- Retrofitting vocabulary slots improves multilingual compression.
- Generalist tokenizers can outperform specialists across domains.
Method
A two-stage retrofit process: (1) script-prune crop to reduce tokens, then (2) surgical replacement of corpus-dead vocabulary slots with Brahmic Unicode blocks using linear programming.
In practice
- Use BrahmicTokenizer-131K for multilingual LLMs.
- Evaluate tokenizers on diverse language benchmarks.
- Consider vocabulary slot allocation for efficiency.
Topics
- Brahmic languages
- Tokenization
- Large Language Models
- Byte-level BPE
- o200k_base
- Multilingual NLP
- Hugging Face
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.