MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

MinGram, a Minimalist Unigram tokenizer, simplifies the traditionally heavy and complex training of Unigram tokenizers while maintaining their elegant token-list representation. It achieves this by utilizing a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step, eliminating the suffix array, forward-backward pass, and iterative prune loop. By prioritizing token count and using a Unigram score only as a tiebreak, MinGram delivers superior compression compared to both BPE and standard Unigram across six languages. A compression-oriented variant matches the strongest token-count compressors while retaining substantially higher morphological alignment. In downstream language-model training, Unigram-family tokenizers, including MinGram, consistently outperform BPE in bits-per-byte.

Key takeaway

For Machine Learning Engineers or NLP Scientists selecting a tokenizer for language model pre-training or text compression, MinGram offers a compelling alternative. You should consider adopting MinGram to achieve better compression than BPE and standard Unigram, while also improving morphological alignment. Its simplified training process and superior bits-per-byte performance in downstream tasks make it an efficient choice for optimizing model efficiency and data representation.

Key insights

MinGram simplifies Unigram tokenizer training, achieving superior compression and morphological alignment for language models.

Principles

Method

MinGram's training uses a BPE-derived seed, Hard EM on a minimum-token path, and a single flat score-pruning step.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.