Yes it's just doing compression. No it's not the diss you think it is.

2023-06-05 · Source: when trees fall... · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

This article argues that large language models (LLMs) optimizing for log-likelihood are fundamentally performing compression, a concept often misunderstood as a limitation. It explains how predicting the next word, the core objective of LLMs, is analogous to minimizing the negative log-likelihood, which directly relates to achieving optimal data compression as per Shannon's source coding theorem. The author illustrates this with Huffman coding, demonstrating how knowing word frequencies allows for shorter average bit-lengths. Furthermore, the piece challenges the notion that compression is antithetical to understanding, citing figures like Marcus Hutter and Gregory Chaitin who link comprehension to data compression. It suggests that a program capable of highly compressing complex information, such as Wikipedia's microeconomics pages, by deriving underlying principles like "supply" and "demand," could be considered to exhibit understanding.

Key takeaway

For AI Researchers evaluating language model capabilities, recognize that optimizing for next-word prediction is inherently a compression task. This perspective reframes the "blurry JPEG" criticism, suggesting that effective compression, especially of complex data like arithmetic or economic principles, can imply a form of understanding. You should explore how compression benchmarks, like the Hutter Prize, can serve as a proxy for evaluating a model's ability to generalize and derive underlying rules from data.

Key insights

Optimizing language models for next-word prediction via log-likelihood is a form of data compression, which can be linked to understanding.

Principles

Maximizing log-likelihood optimizes for compression.
Compression and understanding are not antithetical.
Shorter encoding lengths correlate with higher predictability.

Method

Language models can be used for text compression by predicting word distributions, constructing Huffman trees based on these predictions, and encoding/decoding words using the generated trees.

In practice

Use conditional probabilities for more efficient text encoding.
Explore neural networks for advanced text compression tasks.
Consider compression ratios as a proxy for model understanding.

Topics

Language Models
Data Compression
Information Theory
Huffman Coding
AI Understanding

Best for: AI Researcher, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by when trees fall....