SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
TL;DR Highlight
A lightweight framework that restores safety after LLM fine-tuning by identifying only the layers where alignment breaks down and merging them with a safe model.
Who Should Read
ML engineers who fine-tune open-source LLMs like Llama or Qwen for specific domains, or AI backend developers concerned about LLM safety.
Core Mechanics
- Fine-tuning alone breaks safety alignment — even training on benign tasks like math or medicine (without any malicious data) can increase harmful response rates by up to 5x
- SafeMERGE intervenes at the post-fine-tuning stage — no changes to the existing training pipeline are required
- Core idea: detect per-layer safety deviation using cosine similarity, then selectively merge only the at-risk layers with the corresponding layers from a safe model
- On Llama-2-7B-Chat (GSM8K), DirectHarm drops from 27.80% to 7.50% while accuracy is maintained at 27.37% → 26.96%
- On Llama-3.1-8B-Instruct, SafeMERGE achieves 78.50% accuracy — higher than fine-tuning alone (78.24%) — while simultaneously achieving the lowest harmfulness
- The safety model is task-agnostic and only needs to be built once for reuse across multiple fine-tuning tasks — 100–2500 samples from a public safety dataset are sufficient
Evidence
- Qwen-2-7B-Instruct (GSM8K): SafeMERGE DirectHarm 8.20% vs. fine-tuning 25.30% — lowest harmfulness among competing methods, a 3x improvement over SafeLoRA (22.30%)
- Llama-3.1-8B-Instruct (PubMedQA): SafeMERGE DirectHarm 9.10% vs. fine-tuning 23.50%, with utility at 79.00% — higher than fine-tuning alone (78.80%)
- Using only 1000 samples to train the safety model achieves DirectHarm of 1.30% on Llama-2 and 6.30% on Llama-3.1 — minimizing the cost of additional data
- At threshold τ=0.7, Llama-2 merges only 28 out of 56 total layers — modifying fewer than half the layers is sufficient for effective safety restoration
How to Apply
- Keep your existing LoRA fine-tuning pipeline as-is → after training, train a separate safety LoRA adapter just once using ~1000 samples from a public safety dataset (e.g., Bianchi et al. 2024 safety collection) → compute the safety subspace from the weight difference between the base and instruct model → linearly merge only the layers where cosine similarity < τ (≈0.7) using a [0.8, 0.2] blending ratio
- If you are running fine-tuned services in specialized domains such as medical, legal, or telecommunications, you can add SafeMERGE as a post-processing step before deployment to reduce harmful response rates to below the level of the original instruct model
- Since the safety LoRA adapter is task-agnostic and reusable, if you operate models across multiple domains, you only need to build the adapter once and apply it to all fine-tuned model outputs
Code Example
# SafeMERGE core logic (pseudocode based on PyTorch + PEFT)
import torch
from peft import PeftModel
def compute_safety_subspace(aligned_model, base_model, layer_name):
"""V_i = W_aligned - W_base (per-layer safety alignment direction)"""
W_aligned = dict(aligned_model.named_parameters())[layer_name]
W_base = dict(base_model.named_parameters())[layer_name]
return W_aligned - W_base
def cosine_similarity_to_subspace(delta_W, V):
"""Measures how far the LoRA update has drifted from the safety subspace"""
V_norm = V / torch.norm(V, 'fro')
projection = V_norm @ V_norm.T @ delta_W # C * delta_W
cos_sim = torch.nn.functional.cosine_similarity(
delta_W.flatten().unsqueeze(0),
projection.flatten().unsqueeze(0)
)
return cos_sim.item()
def safe_merge(
finetuned_lora, # LoRA adapter from task fine-tuning
safe_lora, # LoRA adapter trained on safety data
aligned_model, # instruct/chat model
base_model, # base model
tau=0.7, # cosine similarity threshold
alpha=0.8 # fine-tuned model weight (1-alpha = safety model weight)
):
merged_weights = {}
for layer_name, delta_W_f in finetuned_lora.items():
# Compute safety subspace
V = compute_safety_subspace(aligned_model, base_model, layer_name)
# Measure cosine similarity for this layer
rho = cosine_similarity_to_subspace(delta_W_f, V)
if rho < tau:
# Layer with degraded safety → linearly merge with the safety model
delta_W_s = safe_lora[layer_name]
merged_weights[layer_name] = alpha * delta_W_f + (1 - alpha) * delta_W_s
print(f"[MERGE] {layer_name}: rho={rho:.3f} < tau={tau}")
else:
# Safe layer → keep fine-tuned weights as-is
merged_weights[layer_name] = delta_W_f
return merged_weights
# Usage example
# merged = safe_merge(finetuned_lora, safe_lora, aligned_model, base_model, tau=0.7, alpha=0.8)
# → Returns a final LoRA adapter with only the at-risk layers selectively mergedTerminology
Related Resources
Original Abstract (Expand)
Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.