Pretraining Language Models on Historical Text
Summary
TypewriterLM is a 7.24B History language model trained exclusively on English text published before 1913. This initiative addresses significant challenges in developing historical LMs, including data quality, temporal leakage, and consistent post-training pipelines. To overcome these, the project introduces TypewriterCorpus, a 54B-token historical corpus meticulously collected from archival sources with extensive cleaning and leakage mitigation. It also presents lexically grounded instructing tuning, a post-training framework that ensures responses are directly rooted in historical documents, used to create History-LIMA and History-SelfInstruct datasets. For evaluation, TypewriterLM introduces History-Event, a benchmark suite assessing competence, temporal grounding, and data leakage. TypewriterLM and its associated resources are released to support further research in historical language models.
Key takeaway
For NLP Engineers developing models for historical analysis or digital humanities, TypewriterLM offers a critical blueprint. You should consider its specialized corpus and lexically grounded instruction tuning framework to mitigate temporal leakage and ensure factual accuracy when working with pre-modern texts. This approach provides robust methods for maintaining historical consistency, allowing you to build more reliable and contextually appropriate language models for specific historical periods.
Key insights
Training LMs on historical texts requires specialized data, leakage mitigation, and evaluation to ensure temporal consistency.
Principles
- Historical LMs demand rigorous data cleaning and temporal leakage control.
- Post-training must ground LM responses directly in historical sources.
- Evaluation needs to assess temporal consistency and factual grounding.
Method
A lexically grounded instructing tuning framework constrains LM responses to be directly grounded in historical source documents, utilizing datasets like History-LIMA and History-SelfInstruct.
In practice
- Utilize TypewriterCorpus for pretraining historical language models.
- Apply lexically grounded tuning for historical accuracy in LM outputs.
- Evaluate historical LMs with the History-Event benchmark suite.
Topics
- Historical Language Models
- TypewriterLM
- TypewriterCorpus
- Temporal Leakage Mitigation
- Instruction Tuning
- History-Event Benchmark
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.