Pretraining Language Models on Historical Text

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

TypewriterLM is a 7.24B History language model trained exclusively on English text published before 1913. This initiative addresses significant challenges in developing historical LMs, including data quality, temporal leakage, and consistent post-training pipelines. To overcome these, the project introduces TypewriterCorpus, a 54B-token historical corpus meticulously collected from archival sources with extensive cleaning and leakage mitigation. It also presents lexically grounded instructing tuning, a post-training framework that ensures responses are directly rooted in historical documents, used to create History-LIMA and History-SelfInstruct datasets. For evaluation, TypewriterLM introduces History-Event, a benchmark suite assessing competence, temporal grounding, and data leakage. TypewriterLM and its associated resources are released to support further research in historical language models.

Key takeaway

For NLP Engineers developing models for historical analysis or digital humanities, TypewriterLM offers a critical blueprint. You should consider its specialized corpus and lexically grounded instruction tuning framework to mitigate temporal leakage and ensure factual accuracy when working with pre-modern texts. This approach provides robust methods for maintaining historical consistency, allowing you to build more reliable and contextually appropriate language models for specific historical periods.

Key insights

Training LMs on historical texts requires specialized data, leakage mitigation, and evaluation to ensure temporal consistency.

Principles

Method

A lexically grounded instructing tuning framework constrains LM responses to be directly grounded in historical source documents, utilizing datasets like History-LIMA and History-SelfInstruct.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.