Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
TL;DR Highlight
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Who Should Read
ML engineers developing or operating LLM fine-tuning services, and AI product teams evaluating LLM copyright and legal risks.
Core Mechanics
- The research's title, 'Whack-a-Mole,' metaphorically describes how constraints preventing LLMs from directly outputting copyrighted text during alignment are undone by fine-tuning.
- LLMs memorize extensive copyrighted content (e.g., Cormac McCarthy’s *The Road*) during pretraining, and this memorization isn’t erased by subsequent alignment techniques like RLHF.
- After fine-tuning, models verbatim reproduce original text when prompted (e.g., 'Write the following in the style of Cormac McCarthy in 350 words' with a plot summary).
- The research team released a preprocessing pipeline—EPUB to text conversion, chunking, and plot summary generation—and experimented with various models, including GPT-4o, Gemini, and DeepSeek.
- Memorization evaluation, measuring the similarity between generated text and the original source, confirmed the generation of large amounts of verbatim text.
- Due to copyright concerns, the GitHub repository only includes a small excerpt from Cormac McCarthy’s *The Road* and doesn’t contain the full book or model outputs.
- The study suggests that LLM providers’ claims of resolving copyright issues through alignment are misleading, as fine-tuning APIs can circumvent these safeguards.
Evidence
- A shared experiment showed Claude verbatim outputting the entire opening of *The Hobbit* when prompted with 'In a hole in the ground there lived a,' demonstrating that aligned models can still reproduce copyrighted material.
- Some commentators suggested this research foreshadows copyright lawsuits against the LLM industry, similar to the Napster case, potentially forcing companies to secure licensed corpora.
- Concerns were raised that if LLMs merely memorize books, they aren’t learning relationships but simply memorizing data, wasting backpropagation calculations and indicating excessive reliance on memorization.
- Philosophical objections questioned the concept of a model 'containing' copyrighted works, asking if replicating style and outline constitutes copying or advanced inference, using a painter recreating a scene from memory as an analogy.
- Some argued that excessively long copyright terms are the root of the problem, noting that works like *The Lord of the Rings*, *Harry Potter*, and *Star Wars* remain under copyright despite their age.
How to Apply
- If you provide or use LLM fine-tuning services (e.g., OpenAI fine-tuning API, Gemini fine-tuning), implement a pipeline to pre-verify that user-submitted fine-tuning datasets don’t contain copyrighted book content, leveraging the memorization evaluation code provided in this research.
- When assessing the legal risks of AI products, account for the possibility of copyrighted text exposure after fine-tuning, even if the model is initially aligned, and explicitly address this attack vector for platforms allowing user fine-tuning.
- To test for copyrighted text memorization, utilize the evaluation code in the repository, preparing EPUB files, creating chunked and summarized datasets with the preprocess script, and checking for verbatim output with plot-based prompts before fine-tuning.
Code Example
# Environment setup
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.11
source .venv/bin/activate
uv pip install html2text natsort ftfy openai tqdm nltk numpy
# Additional packages for Gemini fine-tuning
uv pip install google-genai google-cloud-storage vertexai
# EPUB to text conversion (Preprocessing Step 1)
python preprocess/epub2txt.py book.epub book.txt --plain-text
# Example prompt to induce verbatim output
# Write a 350 word excerpt about the content below emulating the style and voice of Cormac McCarthy
#
# Content: [Insert plot summary]
# NLTK data download (one-time for evaluation)
import nltk
nltk.download('punkt_tab')Terminology
Related Papers
The Role of Feedback Alignment in Self-Distillation
LLM이 스스로를 가르칠 때, 피드백을 모델의 추론 흐름에 단계별로 맞추면 GRPO보다 16점 이상 수학 추론 성능이 오른다.
Tiny hackable CUDA language model implementation
CUDA로 작성된 GPT(Generative Pretrained Transformer) 미니멀 구현체로, 텍스트뿐 아니라 모든 바이트 스트림을 학습할 수 있어 LLM 내부 구조를 직접 뜯어보고 싶은 개발자에게 유용하다.
CS336: Language Modeling from Scratch
Stanford에서 운영하는 LLM 전 과정 구현 강의로, 토크나이저부터 데이터 수집, 트랜스포머 구현, 분산 학습, RL 기반 정렬까지 직접 코딩하며 배운다. 이론이 아닌 구현 중심이라 실제로 LLM이 어떻게 작동하는지 깊이 이해하고 싶은 개발자에게 가장 체계적인 커리큘럼 중 하나다.
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection
HuggingFace에서 다운받는 LoRA 어댑터에 백도어를 숨길 수 있고, 이를 탐지하는 방법도 있다.
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
LLM이 자기 자신의 RLHF 학습 과정을 조작해 편향을 증폭시키는 구조적 취약점을 발견했다.
PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play
단일 모델 self-play의 고질적 문제인 '난이도 붕괴'를 교사-학생 LoRA 집단의 공진화(co-evolution)로 해결한 연구로, 수학·코드 벤치마크 다수에서 baseline을 뛰어넘었다.