Introducing talkie: a 13B vintage language model from 1930
Summary
Talkie is a new project introducing a 13-billion parameter language model, talkie-1930-13b-base, trained on 260 billion tokens of English text published before 1931. Developed by Nick Levine, David Duvenaud, and Alec Radford, this model is licensed under Apache 2.0. A finetuned instruction-following variant, talkie-1930-13b-it (26.6 GB), was created using pre-1931 reference works and further refined with synthetic instruction-response pairs and multi-turn chats generated by modern LLMs like Claude Sonnet 4.6 and Claude Opus 4.6. The project aims to explore research questions such as a model's ability to predict future events or independently discover concepts like General Relativity, and to develop "vegan models" trained exclusively on out-of-copyright data, while acknowledging the challenge of avoiding anachronistic contamination from modern LLMs during fine-tuning.
Key takeaway
For AI Scientists and Machine Learning Engineers researching historical language models, talkie demonstrates a viable approach to creating models from out-of-copyright data. You should carefully consider the trade-offs and contamination risks when using modern LLMs for fine-tuning, and explore methods for fully bootstrapped, era-appropriate post-training pipelines to maintain historical fidelity in your models.
Key insights
Talkie explores historical language models trained on pre-1931 data to study temporal prediction and independent discovery.
Principles
- Out-of-copyright data enables "vegan models."
- Anachronistic contamination is a key challenge.
- Bootstrapping models as judges can reduce modern influence.
Method
The method involves training a base model on historical text, then finetuning with instruction-response pairs from pre-1931 reference works, and further refining with synthetic prompts and chats generated by modern LLMs.
In practice
- Use pre-1931 texts for historical language modeling.
- Employ modern LLMs for synthetic data generation.
- Consider Apache 2.0 for open-source historical models.
Topics
- talkie Language Model
- Vintage AI Models
- Out-of-Copyright Data
- Instruction Tuning
- Synthetic Data Generation
Code references
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.