How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation
Summary
A diagnostic framework evaluates large language models' (LLMs) ability to process historical language, decomposing difficulty into tokenization cost, predictive uncertainty, semantic robustness, and context sensitivity. This framework was applied to a newly curated 17th-century Italian corpus (1610-1689), 19th-century Italian texts, and 18th-century Russian civil print books. Results show early modern Italian and Russian incur similar 25-30% tokenization inflation. However, 17th-century Italian is 2.4 times more surprising to LLMs than modern Italian, reaching 3.2 times for academic prose, while Russian shows only a modest increase. Despite this predictive uncertainty, embedding similarity remains robust (> 0.85), indicating models can represent historical meaning. A simple temporal context prompt reduces historical surprisal by approximately 60%, suggesting LLMs are suitable for semantic retrieval but require careful adaptation for generative tasks.
Key takeaway
For NLP Engineers or Research Scientists deploying LLMs with historical language data, recognize that while models incur a 25-30% tokenization "tax" and high "surprisal" (up to 3.2x for academic prose), their semantic understanding remains robust. You can confidently use LLMs for semantic retrieval tasks. For generative applications, however, carefully adapt your approach, potentially by integrating simple temporal context prompts, which can reduce surprisal by 60%.
Key insights
LLMs exhibit high surprisal and tokenization costs with historical Italian, yet retain robust semantic understanding, mitigable with temporal context.
Principles
- Historical language difficulty decomposes into distinct dimensions.
- Encoding cost and comprehension are dissociated in LLM processing.
- LLMs maintain robust semantic representation despite generative instability.
Method
A diagnostic framework evaluates historical language processing difficulty by assessing tokenization cost, predictive uncertainty, semantic robustness, and context sensitivity.
In practice
- Deploy LLMs for semantic retrieval tasks on historical texts.
- Apply minimal temporal context prompts to reduce historical surprisal.
- Carefully adapt generative LLM applications for historical content.
Topics
- Large Language Models
- Historical Language Processing
- Tokenization Tax
- Predictive Surprisal
- Semantic Robustness
- Digital Libraries
- Context Prompting
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.