What is an LLM? Tokens, Embeddings, and the Big Picture

· Source: MLWhiz: Recs|ML|GenAI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

This analysis details the fundamental architecture of Large Language Models (LLMs), defining them as functions that convert integer lists into probability distributions for the next token. It thoroughly explains the tokenization process, which transforms raw text into numerical sequences, highlighting its critical role in LLM behavior and common failure modes like character-counting errors. The discussion covers Unicode, byte encoding (UTF-8), and the Byte Pair Encoding (BPE) algorithm, contrasting character-level with the prevalent byte-level BPE used in models like GPT-2, GPT-4, and Llama. It further distinguishes between tokenizer libraries such as SentencePiece and tiktoken, noting that both implement BPE but differ in speed and specific design choices. A comparative table illustrates the evolution of tokenizers across major models, revealing a trend towards larger vocabularies (e.g., GPT-4's 100,277, Llama 3's 128,000, GPT-4o's ~200,019, Gemma's 256,000-262,000 tokens) to enhance multilingual efficiency and reduce inference costs.

Key takeaway

For AI Engineers debugging unexpected LLM behavior, understanding the underlying tokenization process is crucial. You should recognize that LLMs operate on subword tokens, not individual characters, which explains failures in tasks requiring character-level awareness. When selecting models, consider the tokenizer's vocabulary size and type (e.g., byte-level BPE) for optimal multilingual performance and cost efficiency, especially for non-English applications. This knowledge allows you to anticipate model limitations and choose appropriate tools.

Key insights

LLMs operate on subword tokens, not characters, making tokenization a core determinant of their capabilities and predictable failure points.

Principles

Method

The BPE algorithm starts with base units, iteratively merges the most frequent adjacent pairs into new units, and repeats until a target vocabulary size is reached, saving the merge order for inference.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MLWhiz: Recs|ML|GenAI.