SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging

Mar 21, 2025•Aladin Djuhera, S. Kadhe, Farhan Ahmed +2•View PDF

TL;DR Highlight

A lightweight framework that restores safety after LLM fine-tuning by identifying only the layers where alignment breaks down and merging them with a safe model.

Who Should Read

ML engineers who fine-tune open-source LLMs like Llama or Qwen for specific domains, or AI backend developers concerned about LLM safety.

Core Mechanics

Fine-tuning alone breaks safety alignment — even training on benign tasks like math or medicine (without any malicious data) can increase harmful response rates by up to 5x
SafeMERGE intervenes at the post-fine-tuning stage — no changes to the existing training pipeline are required
Core idea: detect per-layer safety deviation using cosine similarity, then selectively merge only the at-risk layers with the corresponding layers from a safe model
On Llama-2-7B-Chat (GSM8K), DirectHarm drops from 27.80% to 7.50% while accuracy is maintained at 27.37% → 26.96%
On Llama-3.1-8B-Instruct, SafeMERGE achieves 78.50% accuracy — higher than fine-tuning alone (78.24%) — while simultaneously achieving the lowest harmfulness
The safety model is task-agnostic and only needs to be built once for reuse across multiple fine-tuning tasks — 100–2500 samples from a public safety dataset are sufficient

Evidence

Qwen-2-7B-Instruct (GSM8K): SafeMERGE DirectHarm 8.20% vs. fine-tuning 25.30% — lowest harmfulness among competing methods, a 3x improvement over SafeLoRA (22.30%)
Llama-3.1-8B-Instruct (PubMedQA): SafeMERGE DirectHarm 9.10% vs. fine-tuning 23.50%, with utility at 79.00% — higher than fine-tuning alone (78.80%)
Using only 1000 samples to train the safety model achieves DirectHarm of 1.30% on Llama-2 and 6.30% on Llama-3.1 — minimizing the cost of additional data
At threshold τ=0.7, Llama-2 merges only 28 out of 56 total layers — modifying fewer than half the layers is sufficient for effective safety restoration

How to Apply

Keep your existing LoRA fine-tuning pipeline as-is → after training, train a separate safety LoRA adapter just once using ~1000 samples from a public safety dataset (e.g., Bianchi et al. 2024 safety collection) → compute the safety subspace from the weight difference between the base and instruct model → linearly merge only the layers where cosine similarity < τ (≈0.7) using a [0.8, 0.2] blending ratio
If you are running fine-tuned services in specialized domains such as medical, legal, or telecommunications, you can add SafeMERGE as a post-processing step before deployment to reduce harmful response rates to below the level of the original instruct model
Since the safety LoRA adapter is task-agnostic and reusable, if you operate models across multiple domains, you only need to build the adapter once and apply it to all fine-tuned model outputs

Code Example

snippet

# SafeMERGE core logic (pseudocode based on PyTorch + PEFT)
import torch
from peft import PeftModel

def compute_safety_subspace(aligned_model, base_model, layer_name):
    """V_i = W_aligned - W_base (per-layer safety alignment direction)"""
    W_aligned = dict(aligned_model.named_parameters())[layer_name]
    W_base = dict(base_model.named_parameters())[layer_name]
    return W_aligned - W_base

def cosine_similarity_to_subspace(delta_W, V):
    """Measures how far the LoRA update has drifted from the safety subspace"""
    V_norm = V / torch.norm(V, 'fro')
    projection = V_norm @ V_norm.T @ delta_W  # C * delta_W
    cos_sim = torch.nn.functional.cosine_similarity(
        delta_W.flatten().unsqueeze(0),
        projection.flatten().unsqueeze(0)
    )
    return cos_sim.item()

def safe_merge(
    finetuned_lora,   # LoRA adapter from task fine-tuning
    safe_lora,        # LoRA adapter trained on safety data
    aligned_model,    # instruct/chat model
    base_model,       # base model
    tau=0.7,          # cosine similarity threshold
    alpha=0.8         # fine-tuned model weight (1-alpha = safety model weight)
):
    merged_weights = {}
    
    for layer_name, delta_W_f in finetuned_lora.items():
        # Compute safety subspace
        V = compute_safety_subspace(aligned_model, base_model, layer_name)
        
        # Measure cosine similarity for this layer
        rho = cosine_similarity_to_subspace(delta_W_f, V)
        
        if rho < tau:
            # Layer with degraded safety → linearly merge with the safety model
            delta_W_s = safe_lora[layer_name]
            merged_weights[layer_name] = alpha * delta_W_f + (1 - alpha) * delta_W_s
            print(f"[MERGE] {layer_name}: rho={rho:.3f} < tau={tau}")
        else:
            # Safe layer → keep fine-tuned weights as-is
            merged_weights[layer_name] = delta_W_f
    
    return merged_weights

# Usage example
# merged = safe_merge(finetuned_lora, safe_lora, aligned_model, base_model, tau=0.7, alpha=0.8)
# → Returns a final LoRA adapter with only the at-risk layers selectively merged

Terminology

Safety AlignmentThe process of training an LLM to refuse harmful or unethical requests. Like traffic laws, it is about getting the model to internalize rules about what it should not do.

LoRAA parameter-efficient fine-tuning technique that trains only two small additional matrices rather than retraining the entire model. Analogous to patching a piece of clothing rather than buying a whole new outfit.

Cosine SimilarityA metric that measures whether two vectors point in the same direction. A value of 1 means they point in exactly the same direction, 0 means they are unrelated, and -1 means they point in opposite directions. Here, it is used to measure how far a layer has drifted from the safe direction.

Safety SubspaceThe 'direction in which safety was learned,' constructed from the weight difference between the aligned model and the base model. Layers that drift away from this direction after fine-tuning are considered at-risk.

Model MergingA technique for creating a new model by mathematically combining the weights of two models. Methods range from simple averaging to weighted summation, allowing model characteristics to be blended without retraining.

Red-teamingThe process of intentionally testing whether a model responds to harmful requests. Similar to penetration testing in security, it involves throwing adversarial prompts at the model to find its vulnerabilities.

Post-fine-tuning DefenseA safety restoration technique applied after fine-tuning is complete. Because it does not touch the training process, it can be added non-invasively to an existing pipeline.

Related Resources

Original Abstract (Expand)

Fine-tuning large language models (LLMs) is a common practice to adapt generalist models to specialized domains. However, recent studies show that fine-tuning can erode safety alignment, causing LLMs to respond to harmful or unethical prompts. Many methods to realign safety have been proposed, but often introduce custom algorithms that are difficult to implement or compromise task utility. In this work, we propose SafeMERGE, a lightweight, post-fine-tuning framework that preserves safety while maintaining downstream performance. SafeMERGE selectively merges fine-tuned with safety-aligned model layers only when they deviate from safe behavior, measured by a cosine similarity criterion. Across three LLMs and two tasks, SafeMERGE consistently reduces harmful outputs compared to other defenses, with negligible or even positive impact on utility. Our results demonstrate that selective layer-wise merging offers an effective safeguard against the inadvertent loss of safety during fine-tuning, establishing SafeMERGE as a simple post-fine-tuning defense.