BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

BrahmicTokenizer-131K is a new 131,072-vocabulary byte-level BPE tokenizer designed as a drop-in replacement for OpenAI's o200k_base. It aims to close the Brahmic language compression gap while preserving strong performance on English, EU languages, and code. The tokenizer was constructed via a two-stage retrofit process: first, a script-prune crop reduced 200,019 tokens to 131,072 by removing nine out-of-scope writing systems; second, 2,372 corpus-dead vocabulary slots were surgically replaced with Brahmic Unicode blocks using linear programming. On 27 million Indic documents (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K achieves 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m, with Odia compression improving by 76.79%. It matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and outperforms Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K benchmarks. This tokenizer is uniquely competitive across Brahmic, English, EU, code, and math at its 131K budget, unlike specialist Indic tokenizers that sacrifice non-Indic performance. It is released under Apache 2.0.

Key takeaway

For Machine Learning Engineers developing multilingual Large Language Models, BrahmicTokenizer-131K offers a significant advantage. If your models require efficient tokenization across Indic, English, EU, and code languages, integrate this Apache 2.0 licensed tokenizer. It provides superior compression for Brahmic languages without sacrificing performance on other critical domains, unlike specialist Indic tokenizers. This can lead to more compact models and improved inference efficiency for diverse linguistic applications.

Key insights

BrahmicTokenizer-131K offers balanced, high-performance tokenization for Indic, English, EU, and code languages at a 131K vocabulary budget.

Principles

Method

A two-stage retrofit process: (1) script-prune crop to reduce tokens, then (2) surgical replacement of corpus-dead vocabulary slots with Brahmic Unicode blocks using linear programming.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.