How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation

2026-06-25 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

A diagnostic framework evaluates large language models' (LLMs) ability to process historical language, decomposing difficulty into tokenization cost, predictive uncertainty, semantic robustness, and context sensitivity. This framework was applied to a newly curated 17th-century Italian corpus (1610-1689), 19th-century Italian texts, and 18th-century Russian civil print books. Results show early modern Italian and Russian incur similar 25-30% tokenization inflation. However, 17th-century Italian is 2.4 times more surprising to LLMs than modern Italian, reaching 3.2 times for academic prose, while Russian shows only a modest increase. Despite this predictive uncertainty, embedding similarity remains robust (> 0.85), indicating models can represent historical meaning. A simple temporal context prompt reduces historical surprisal by approximately 60%, suggesting LLMs are suitable for semantic retrieval but require careful adaptation for generative tasks.

Key takeaway

For NLP Engineers or Research Scientists deploying LLMs with historical language data, recognize that while models incur a 25-30% tokenization "tax" and high "surprisal" (up to 3.2x for academic prose), their semantic understanding remains robust. You can confidently use LLMs for semantic retrieval tasks. For generative applications, however, carefully adapt your approach, potentially by integrating simple temporal context prompts, which can reduce surprisal by 60%.

Key insights

LLMs exhibit high surprisal and tokenization costs with historical Italian, yet retain robust semantic understanding, mitigable with temporal context.

Principles

Historical language difficulty decomposes into distinct dimensions.
Encoding cost and comprehension are dissociated in LLM processing.
LLMs maintain robust semantic representation despite generative instability.

Method

A diagnostic framework evaluates historical language processing difficulty by assessing tokenization cost, predictive uncertainty, semantic robustness, and context sensitivity.

In practice

Deploy LLMs for semantic retrieval tasks on historical texts.
Apply minimal temporal context prompts to reduce historical surprisal.
Carefully adapt generative LLM applications for historical content.

Topics

Large Language Models
Historical Language Processing
Tokenization Tax
Predictive Surprisal
Semantic Robustness
Digital Libraries
Context Prompting

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.