History LLMs: Models trained exclusively on pre-1913 texts
TL;DR Highlight
A 4B parameter LLM family trained from scratch on 80B tokens of historical text through 1913 — it embodies the pre-WWI worldview and can't know about anything after.
Who Should Read
ML researchers interested in domain-specific pretraining and historical NLP, and digital humanities scholars exploring AI for historical text analysis.
Core Mechanics
- The model family was trained exclusively on historical texts from before 1913 — newspapers, books, letters, government documents — containing no modern vocabulary, concepts, or references.
- The result is a model that genuinely 'thinks' in the idiom and worldview of the late 19th/early 20th century: it doesn't know about WWI, modern physics, or computing.
- This makes it useful for period-accurate text generation, historical document analysis, and studying how language and reasoning patterns have changed over time.
- The 80B token pretraining corpus is notable — assembling high-quality historical text at this scale required significant digitization and cleaning effort.
- The model family (4B parameters) is small enough to run locally, making it accessible for humanities research without GPU cluster requirements.
- Evaluation showed the model excels at historical text completion and period-appropriate prose generation, but obviously fails at any task requiring modern knowledge.
Evidence
- Demo outputs shared in the HN thread showed convincingly period-accurate prose — no anachronisms, appropriate vocabulary register, and correct historical references.
- Digital humanities researchers in the comments expressed genuine excitement, noting this fills a gap in current NLP tools for pre-modern text analysis.
- Debate about the value of 'isolated' period models vs fine-tuning a modern model on historical data — the argument for isolation is that modern training data introduces anachronistic reasoning patterns.
- Historians noted potential use in transcribing and extending damaged historical documents where period-accurate language modeling is crucial.
How to Apply
- For digital humanities: use the model for historical document completion, transcription assistance, and period-accurate text generation without worrying about modern contamination.
- For NLP researchers: this is a useful probe model for studying how language and conceptual structure have changed — compare outputs on the same prompts to a modern model.
- If you're building historical education tools or games, this model provides a unique source of period-appropriate generated content.
- The corpus assembly methodology (80B tokens of pre-1913 text) is itself worth studying for researchers building other domain-specific or temporal pretraining datasets.
Terminology
Related Papers
Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library
PyTorch Lightning packages 2.6.2 and 2.6.3 delivered credential-stealing malware via a supply chain attack.
Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
Show HN: I built a tiny LLM to demystify how language models work
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.