Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
TL;DR Highlight
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Who Should Read
ML engineers developing or operating LLM fine-tuning services, and AI product teams evaluating LLM copyright and legal risks.
Core Mechanics
- The research's title, 'Whack-a-Mole,' metaphorically describes how constraints preventing LLMs from directly outputting copyrighted text during alignment are undone by fine-tuning.
- LLMs memorize extensive copyrighted content (e.g., Cormac McCarthy’s *The Road*) during pretraining, and this memorization isn’t erased by subsequent alignment techniques like RLHF.
- After fine-tuning, models verbatim reproduce original text when prompted (e.g., 'Write the following in the style of Cormac McCarthy in 350 words' with a plot summary).
- The research team released a preprocessing pipeline—EPUB to text conversion, chunking, and plot summary generation—and experimented with various models, including GPT-4o, Gemini, and DeepSeek.
- Memorization evaluation, measuring the similarity between generated text and the original source, confirmed the generation of large amounts of verbatim text.
- Due to copyright concerns, the GitHub repository only includes a small excerpt from Cormac McCarthy’s *The Road* and doesn’t contain the full book or model outputs.
- The study suggests that LLM providers’ claims of resolving copyright issues through alignment are misleading, as fine-tuning APIs can circumvent these safeguards.
Evidence
- A shared experiment showed Claude verbatim outputting the entire opening of *The Hobbit* when prompted with 'In a hole in the ground there lived a,' demonstrating that aligned models can still reproduce copyrighted material.
- Some commentators suggested this research foreshadows copyright lawsuits against the LLM industry, similar to the Napster case, potentially forcing companies to secure licensed corpora.
- Concerns were raised that if LLMs merely memorize books, they aren’t learning relationships but simply memorizing data, wasting backpropagation calculations and indicating excessive reliance on memorization.
- Philosophical objections questioned the concept of a model 'containing' copyrighted works, asking if replicating style and outline constitutes copying or advanced inference, using a painter recreating a scene from memory as an analogy.
- Some argued that excessively long copyright terms are the root of the problem, noting that works like *The Lord of the Rings*, *Harry Potter*, and *Star Wars* remain under copyright despite their age.
How to Apply
- If you provide or use LLM fine-tuning services (e.g., OpenAI fine-tuning API, Gemini fine-tuning), implement a pipeline to pre-verify that user-submitted fine-tuning datasets don’t contain copyrighted book content, leveraging the memorization evaluation code provided in this research.
- When assessing the legal risks of AI products, account for the possibility of copyrighted text exposure after fine-tuning, even if the model is initially aligned, and explicitly address this attack vector for platforms allowing user fine-tuning.
- To test for copyrighted text memorization, utilize the evaluation code in the repository, preparing EPUB files, creating chunked and summarized datasets with the preprocess script, and checking for verbatim output with plot-based prompts before fine-tuning.
Code Example
# Environment setup
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.11
source .venv/bin/activate
uv pip install html2text natsort ftfy openai tqdm nltk numpy
# Additional packages for Gemini fine-tuning
uv pip install google-genai google-cloud-storage vertexai
# EPUB to text conversion (Preprocessing Step 1)
python preprocess/epub2txt.py book.epub book.txt --plain-text
# Example prompt to induce verbatim output
# Write a 350 word excerpt about the content below emulating the style and voice of Cormac McCarthy
#
# Content: [Insert plot summary]
# NLTK data download (one-time for evaluation)
import nltk
nltk.download('punkt_tab')Terminology
Related Papers
Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library
PyTorch Lightning packages 2.6.2 and 2.6.3 delivered credential-stealing malware via a supply chain attack.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
Show HN: I built a tiny LLM to demystify how language models work
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.
Nanocode: The best Claude Code that $200 can buy in pure JAX on TPUs
An open-source library that allows you to train a 1.3B parameter coding agent model from scratch on a $200 (approximately 270,000 KRW) TPU, following Anthropic's Constitutional AI approach. It can serve as a hands-on reference for developers who want to directly understand the entire AI training pipeline.