Your LLM Has Never Read a Single Word — How Tokenization Grinds Text Into Numbers

2026-03-05 · Source: Naturallanguageprocessing on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

Large Language Models (LLMs) like GPT-4, Claude, and Llama do not process text as words or letters but as numeric tokens through a process called tokenization. This irreversible transformation, likened to an "industrial meat grinder," converts rich linguistic structures into uniform numeric blocks, leading to a "character blindness" where models cannot access individual character information within tokens. Historically, NLP engineers explored word-level and character-level tokenization, finding them inefficient due to the "[UNK] catastrophe" and "sequence explosion," respectively. Subword tokenization emerged as the golden mean, used by all modern LLMs. Two dominant algorithms are BPE (Byte-Pair Encoding), which merges frequent character pairs and is used by GPT and Llama, and WordPiece, which optimizes for predictive likelihood and is used by BERT. This tokenization process results in issues like the "strawberry paradox," where models fail character-level tasks, and a "tokenization tax," causing higher costs and latency for non-English languages due to English-centric training corpora.

Key takeaway

For AI Product Managers developing multilingual applications, understand that tokenization directly impacts cost and performance. Your API budgeting for non-English languages, especially those with complex character sets like Hindi or Polish, must account for the "Token Tax" which can increase costs by 1.6x to 3x. Additionally, avoid relying on LLMs for character-level operations without implementing Chain of Thought workarounds, which incur extra compute time.

Key insights

LLMs process numeric tokens, not words or letters, leading to inherent limitations like character blindness and variable processing costs across languages.

Principles

Tokenization is a lossy transformation.
Subword tokenization balances vocabulary size and sequence length.

Method

Subword tokenization algorithms like BPE (frequency-based merging) and WordPiece (likelihood optimization) construct vocabularies by splitting rare words into meaningful subunits.

In practice

Use `tiktoken` to inspect GPT-4 tokenization.
Account for "Token Tax" in multilingual API budgeting.

Topics

Tokenization
Subword Tokenization
Byte-Pair Encoding
WordPiece
Large Language Models

Best for: Machine Learning Engineer, NLP Engineer, AI Product Manager

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.