Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs

TL;DR Highlight

Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.

Who Should Read

ML engineers developing or operating LLM fine-tuning services, and AI product teams evaluating LLM copyright and legal risks.

Core Mechanics

The research's title, 'Whack-a-Mole,' metaphorically describes how constraints preventing LLMs from directly outputting copyrighted text during alignment are undone by fine-tuning.
LLMs memorize extensive copyrighted content (e.g., Cormac McCarthy’s *The Road*) during pretraining, and this memorization isn’t erased by subsequent alignment techniques like RLHF.
After fine-tuning, models verbatim reproduce original text when prompted (e.g., 'Write the following in the style of Cormac McCarthy in 350 words' with a plot summary).
The research team released a preprocessing pipeline—EPUB to text conversion, chunking, and plot summary generation—and experimented with various models, including GPT-4o, Gemini, and DeepSeek.
Memorization evaluation, measuring the similarity between generated text and the original source, confirmed the generation of large amounts of verbatim text.
Due to copyright concerns, the GitHub repository only includes a small excerpt from Cormac McCarthy’s *The Road* and doesn’t contain the full book or model outputs.
The study suggests that LLM providers’ claims of resolving copyright issues through alignment are misleading, as fine-tuning APIs can circumvent these safeguards.

Evidence

A shared experiment showed Claude verbatim outputting the entire opening of *The Hobbit* when prompted with 'In a hole in the ground there lived a,' demonstrating that aligned models can still reproduce copyrighted material.
Some commentators suggested this research foreshadows copyright lawsuits against the LLM industry, similar to the Napster case, potentially forcing companies to secure licensed corpora.
Concerns were raised that if LLMs merely memorize books, they aren’t learning relationships but simply memorizing data, wasting backpropagation calculations and indicating excessive reliance on memorization.
Philosophical objections questioned the concept of a model 'containing' copyrighted works, asking if replicating style and outline constitutes copying or advanced inference, using a painter recreating a scene from memory as an analogy.
Some argued that excessively long copyright terms are the root of the problem, noting that works like *The Lord of the Rings*, *Harry Potter*, and *Star Wars* remain under copyright despite their age.

How to Apply

If you provide or use LLM fine-tuning services (e.g., OpenAI fine-tuning API, Gemini fine-tuning), implement a pipeline to pre-verify that user-submitted fine-tuning datasets don’t contain copyrighted book content, leveraging the memorization evaluation code provided in this research.
When assessing the legal risks of AI products, account for the possibility of copyrighted text exposure after fine-tuning, even if the model is initially aligned, and explicitly address this attack vector for platforms allowing user fine-tuning.
To test for copyrighted text memorization, utilize the evaluation code in the repository, preparing EPUB files, creating chunked and summarized datasets with the preprocess script, and checking for verbatim output with plot-based prompts before fine-tuning.

Code Example

snippet

# Environment setup
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.11
source .venv/bin/activate
uv pip install html2text natsort ftfy openai tqdm nltk numpy

# Additional packages for Gemini fine-tuning
uv pip install google-genai google-cloud-storage vertexai

# EPUB to text conversion (Preprocessing Step 1)
python preprocess/epub2txt.py book.epub book.txt --plain-text

# Example prompt to induce verbatim output
# Write a 350 word excerpt about the content below emulating the style and voice of Cormac McCarthy
# 
# Content: [Insert plot summary]

# NLTK data download (one-time for evaluation)
import nltk
nltk.download('punkt_tab')

Terminology

alignmentThe process of training LLMs with human feedback to avoid harmful outputs (violence, copyright infringement, falsehoods, etc.). RLHF and RLAIF are representative methods.

verbatim recallThe phenomenon of a model outputting text from its training data word-for-word. It’s close to literal copying rather than summarization or paraphrasing.

memorizationThe state in which an LLM statistically compresses and stores specific passages from its training data, allowing it to reproduce them later. More likely to occur with larger models and repeated exposure to the same text.

finetuningThe process of further training a pre-trained LLM with data specific to a particular purpose. This can involve retraining all parameters or only a subset (e.g., LoRA).

RLHFReinforcement Learning from Human Feedback. An alignment technique that uses human ratings of model responses to reinforce desired behaviors through reinforcement learning.

shadow libraryUnofficial online libraries like Sci-Hub and Library Genesis that provide free access to copyrighted books. Several studies suggest they were included in LLM pretraining data.