Spurious Forgetting in Continual Learning of Language Models
TL;DR Highlight
LLM performance drops after learning new tasks not because of knowledge loss, but because task alignment breaks — and simply freezing lower layers mostly prevents it.
Who Should Read
ML engineers who sequentially fine-tune LLMs or consider additional fine-tuning after safety alignment. Especially those who've seen previous task performance suddenly drop after new task training.
Core Mechanics
- Performance drops when learning new tasks are due to 'task alignment collapse' not 'knowledge loss' — proven by experiments showing recovery with just 10 examples
- The first 150 steps of new task training are the core problem — previous task alignment is rapidly overwritten in this window
- Lower layers (including embeddings) handle task alignment; orthogonal weight updates in these layers cause spurious forgetting
- Freezing lower layers alone improved SEQ Task 0 accuracy from 11% to 44% — all existing methods (EWC, LAMOL, Gradient Projection) stayed below 22%
- Applied to LLaMa-2-7B-Chat safety alignment: jailbreak rate dropped from 99.80% to 1.15% (6 layers frozen)
- Validated on LLaMa-3-8B-Instruct, Qwen2.5-7B-Instruct, Mistral-8B for math/code SFT — Freeze mitigated general capability degradation
Evidence
- Biography synthetic dataset: Freeze (7 layers + early stop) achieved Task 0 accuracy 44.22% vs SEQ 11.18%, best competitor (Task Vector) 30.75%
- Safety Alignment: freezing 6 layers dropped jailbreak rate from 99.80% to 1.15% (LLaMa-2-7B-Chat)
- Recovery experiment: 96% recovered Task 0 accuracy maintained even after 150 steps of Task 1 training — knowledge itself is intact
- LLaMa-3-8B-Instruct math SFT: general capability avg 64.15 vs 66.11 with Freeze; math ability maintained (80.29 vs 80.17)
How to Apply
- During sequential fine-tuning: freeze lower 1-3 layers (+ embeddings) immediately after learning the first task and train subsequent tasks. More similar task formats benefit from freezing more layers.
- If additional fine-tuning is needed after safety alignment, freeze the lower 6 layers during fine-tuning to greatly suppress safety alignment collapse.
- Even for single-task fine-tuning like code/math SFT: freezing just the bottom 1 layer reduces general capability degradation while maintaining target performance.
Code Example
# Example of freezing the bottom N layers with HuggingFace Transformers
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
N_FREEZE_LAYERS = 2 # Number of bottom layers to freeze
# Freeze embeddings + bottom N layers
for param in model.model.embed_tokens.parameters():
param.requires_grad = False
for i in range(N_FREEZE_LAYERS):
for param in model.model.layers[i].parameters():
param.requires_grad = False
# Check the number of trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable/total:.1%}")
# Proceed with standard fine-tuning afterwardsTerminology
Related Resources
Original Abstract (Expand)
Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning: despite extensive training, models experience significant performance declines, raising questions about task alignment and underlying knowledge retention. This study first explores the concept of"spurious forgetting", proposing that such performance drops often reflect a decline in task alignment rather than true knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fix the bottom layers of the model, leading to substantial improvements in four continual learning scenarios. Our findings underscore the critical distinction between task alignment and knowledge retention, paving the way for more effective strategies in continual learning.