Your LLM Has Never Read a Single Word — How Tokenization Grinds Text Into Numbers
Summary
Large Language Models (LLMs) like GPT-4, Claude, and Llama do not process text as words or letters but as numeric tokens through a process called tokenization. This irreversible transformation, likened to an "industrial meat grinder," converts rich linguistic structures into uniform numeric blocks, leading to a "character blindness" where models cannot access individual character information within tokens. Historically, NLP engineers explored word-level and character-level tokenization, finding them inefficient due to the "[UNK] catastrophe" and "sequence explosion," respectively. Subword tokenization emerged as the golden mean, used by all modern LLMs. Two dominant algorithms are BPE (Byte-Pair Encoding), which merges frequent character pairs and is used by GPT and Llama, and WordPiece, which optimizes for predictive likelihood and is used by BERT. This tokenization process results in issues like the "strawberry paradox," where models fail character-level tasks, and a "tokenization tax," causing higher costs and latency for non-English languages due to English-centric training corpora.
Key takeaway
For AI Product Managers developing multilingual applications, understand that tokenization directly impacts cost and performance. Your API budgeting for non-English languages, especially those with complex character sets like Hindi or Polish, must account for the "Token Tax" which can increase costs by 1.6x to 3x. Additionally, avoid relying on LLMs for character-level operations without implementing Chain of Thought workarounds, which incur extra compute time.
Key insights
LLMs process numeric tokens, not words or letters, leading to inherent limitations like character blindness and variable processing costs across languages.
Principles
- Tokenization is a lossy transformation.
- Subword tokenization balances vocabulary size and sequence length.
Method
Subword tokenization algorithms like BPE (frequency-based merging) and WordPiece (likelihood optimization) construct vocabularies by splitting rare words into meaningful subunits.
In practice
- Use `tiktoken` to inspect GPT-4 tokenization.
- Account for "Token Tax" in multilingual API budgeting.
Topics
- Tokenization
- Subword Tokenization
- Byte-Pair Encoding
- WordPiece
- Large Language Models
Best for: Machine Learning Engineer, NLP Engineer, AI Product Manager
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.