History LLMs: Models trained exclusively on pre-1913 texts
TL;DR Highlight
A 4B parameter LLM family trained from scratch on 80B tokens of historical text through 1913 — it embodies the pre-WWI worldview and can't know about anything after.
Who Should Read
ML researchers interested in domain-specific pretraining and historical NLP, and digital humanities scholars exploring AI for historical text analysis.
Core Mechanics
- The model family was trained exclusively on historical texts from before 1913 — newspapers, books, letters, government documents — containing no modern vocabulary, concepts, or references.
- The result is a model that genuinely 'thinks' in the idiom and worldview of the late 19th/early 20th century: it doesn't know about WWI, modern physics, or computing.
- This makes it useful for period-accurate text generation, historical document analysis, and studying how language and reasoning patterns have changed over time.
- The 80B token pretraining corpus is notable — assembling high-quality historical text at this scale required significant digitization and cleaning effort.
- The model family (4B parameters) is small enough to run locally, making it accessible for humanities research without GPU cluster requirements.
- Evaluation showed the model excels at historical text completion and period-appropriate prose generation, but obviously fails at any task requiring modern knowledge.
Evidence
- Demo outputs shared in the HN thread showed convincingly period-accurate prose — no anachronisms, appropriate vocabulary register, and correct historical references.
- Digital humanities researchers in the comments expressed genuine excitement, noting this fills a gap in current NLP tools for pre-modern text analysis.
- Debate about the value of 'isolated' period models vs fine-tuning a modern model on historical data — the argument for isolation is that modern training data introduces anachronistic reasoning patterns.
- Historians noted potential use in transcribing and extending damaged historical documents where period-accurate language modeling is crucial.
How to Apply
- For digital humanities: use the model for historical document completion, transcription assistance, and period-accurate text generation without worrying about modern contamination.
- For NLP researchers: this is a useful probe model for studying how language and conceptual structure have changed — compare outputs on the same prompts to a modern model.
- If you're building historical education tools or games, this model provides a unique source of period-appropriate generated content.
- The corpus assembly methodology (80B tokens of pre-1913 text) is itself worth studying for researchers building other domain-specific or temporal pretraining datasets.
Terminology
Domain-specific pretrainingTraining a language model from scratch on a curated corpus focused on a specific domain, time period, or genre rather than general web text.
Temporal isolationA training methodology where the data cutoff is a deliberate design choice, preventing the model from learning about events or concepts after a specific date.
Digital humanitiesAcademic field applying computational methods to humanistic research — history, literature, linguistics — including NLP for historical text analysis.