History LLMs: Models trained exclusively on pre-1913 texts
TL;DR Highlight
A 4B parameter LLM family trained from scratch on 80B tokens of historical text through 1913 — it embodies the pre-WWI worldview and can't know about anything after.
Who Should Read
ML researchers interested in domain-specific pretraining and historical NLP, and digital humanities scholars exploring AI for historical text analysis.
Core Mechanics
- The model family was trained exclusively on historical texts from before 1913 — newspapers, books, letters, government documents — containing no modern vocabulary, concepts, or references.
- The result is a model that genuinely 'thinks' in the idiom and worldview of the late 19th/early 20th century: it doesn't know about WWI, modern physics, or computing.
- This makes it useful for period-accurate text generation, historical document analysis, and studying how language and reasoning patterns have changed over time.
- The 80B token pretraining corpus is notable — assembling high-quality historical text at this scale required significant digitization and cleaning effort.
- The model family (4B parameters) is small enough to run locally, making it accessible for humanities research without GPU cluster requirements.
- Evaluation showed the model excels at historical text completion and period-appropriate prose generation, but obviously fails at any task requiring modern knowledge.
Evidence
- Demo outputs shared in the HN thread showed convincingly period-accurate prose — no anachronisms, appropriate vocabulary register, and correct historical references.
- Digital humanities researchers in the comments expressed genuine excitement, noting this fills a gap in current NLP tools for pre-modern text analysis.
- Debate about the value of 'isolated' period models vs fine-tuning a modern model on historical data — the argument for isolation is that modern training data introduces anachronistic reasoning patterns.
- Historians noted potential use in transcribing and extending damaged historical documents where period-accurate language modeling is crucial.
How to Apply
- For digital humanities: use the model for historical document completion, transcription assistance, and period-accurate text generation without worrying about modern contamination.
- For NLP researchers: this is a useful probe model for studying how language and conceptual structure have changed — compare outputs on the same prompts to a modern model.
- If you're building historical education tools or games, this model provides a unique source of period-appropriate generated content.
- The corpus assembly methodology (80B tokens of pre-1913 text) is itself worth studying for researchers building other domain-specific or temporal pretraining datasets.
Terminology
Related Papers
Show HN: Neural Particle Automata
고정된 격자 대신 움직이는 파티클 위에서 동작하는 Neural Cellular Automata의 확장 버전으로, 형태 생성·포인트 클라우드 분류·텍스처 합성 등 다양한 작업에서 자기조직화 동작을 학습할 수 있다.
The annotated PyTorch training loop
PyTorch 학습 루프의 각 코드 줄이 왜 그 위치에 있어야 하는지, 순서를 바꾸거나 빠뜨렸을 때 어떤 문제가 생기는지를 단계별로 설명한 심층 가이드다.
When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks
VLM 자가학습 루프에서 verifier가 특정 태스크에 맞지 않으면 학습할수록 오히려 성능이 떨어지는데, DPO 손실값은 멀쩡히 내려가서 눈치채기도 어렵다.
The Role of Feedback Alignment in Self-Distillation
LLM이 스스로를 가르칠 때, 피드백을 모델의 추론 흐름에 단계별로 맞추면 GRPO보다 16점 이상 수학 추론 성능이 오른다.
Tiny hackable CUDA language model implementation
CUDA로 작성된 GPT(Generative Pretrained Transformer) 미니멀 구현체로, 텍스트뿐 아니라 모든 바이트 스트림을 학습할 수 있어 LLM 내부 구조를 직접 뜯어보고 싶은 개발자에게 유용하다.
CS336: Language Modeling from Scratch
Stanford에서 운영하는 LLM 전 과정 구현 강의로, 토크나이저부터 데이터 수집, 트랜스포머 구현, 분산 학습, RL 기반 정렬까지 직접 코딩하며 배운다. 이론이 아닌 구현 중심이라 실제로 LLM이 어떻게 작동하는지 깊이 이해하고 싶은 개발자에게 가장 체계적인 커리큘럼 중 하나다.