Introducing talkie: a 13B vintage language model from 1930

2026-04-28 · Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation, Data Science & Analytics · Depth: Advanced, short

Summary

Talkie is a new project introducing a 13-billion parameter language model, talkie-1930-13b-base, trained on 260 billion tokens of English text published before 1931. Developed by Nick Levine, David Duvenaud, and Alec Radford, this model is licensed under Apache 2.0. A finetuned instruction-following variant, talkie-1930-13b-it (26.6 GB), was created using pre-1931 reference works and further refined with synthetic instruction-response pairs and multi-turn chats generated by modern LLMs like Claude Sonnet 4.6 and Claude Opus 4.6. The project aims to explore research questions such as a model's ability to predict future events or independently discover concepts like General Relativity, and to develop "vegan models" trained exclusively on out-of-copyright data, while acknowledging the challenge of avoiding anachronistic contamination from modern LLMs during fine-tuning.

Key takeaway

For AI Scientists and Machine Learning Engineers researching historical language models, talkie demonstrates a viable approach to creating models from out-of-copyright data. You should carefully consider the trade-offs and contamination risks when using modern LLMs for fine-tuning, and explore methods for fully bootstrapped, era-appropriate post-training pipelines to maintain historical fidelity in your models.

Key insights

Talkie explores historical language models trained on pre-1931 data to study temporal prediction and independent discovery.

Principles

Out-of-copyright data enables "vegan models."
Anachronistic contamination is a key challenge.
Bootstrapping models as judges can reduce modern influence.

Method

The method involves training a base model on historical text, then finetuning with instruction-response pairs from pre-1931 reference works, and further refining with synthetic prompts and chats generated by modern LLMs.

In practice

Use pre-1931 texts for historical language modeling.
Employ modern LLMs for synthetic data generation.
Consider Apache 2.0 for open-source historical models.

Topics

talkie Language Model
Vintage AI Models
Out-of-Copyright Data
Instruction Tuning
Synthetic Data Generation

Code references

openai/human-eval

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.