Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs

Jan 17, 2026•Ziyi Zhao, Chongming Gao, Yang Zhang +5•View PDF

TL;DR Highlight

A lightweight adapter framework that migrates per-user soft prompts across LLM upgrades at 98% less cost

Who Should Read

ML engineers running LLM-based recommendation systems or personalization services. Especially teams wondering how to preserve tens of thousands of user profiles during model upgrades.

Core Mechanics

Soft prompts (lightweight vectors encoding user preferences) are tied to a specific LLM, requiring full retraining of all user data when switching models
PUMA maps old model soft prompts to the new model's space using a single small feed-forward adapter, enabling migration without retraining
Clusters all users via K-means, then uses stratified sampling by behavioral variance within each cluster to train the adapter on just 2,000 representative users
Works not only for Llama-2-1B → Llama-2-3B but also across completely different architectures like LLaMA → Qwen, Phi, Gemma, StableLM
Aggregated migration combining prompts from multiple source models into one target model actually outperforms single-source (knowledge synergy)
Performance remains stable even through chain migrations A→B→C→D→E

Evidence

Amazon dataset: PUMA RMSE 0.9135, better than full retraining (0.9414); MIND uAUC 0.6552 vs 0.5289
Training time: 50x faster than full retraining (Amazon: 24hrs → 0.48hrs), up to 98% compute cost reduction
PUMA with 2,000 users (RMSE 0.9315) outperforms random sampling 6,000 users (RMSE 0.9320) with 1/3 the data
Llama+StableLM aggregated migration (RMSE 0.9217) outperforms single-source (Llama 0.9293, StableLM 0.9380)

How to Apply

If running a 1+N system (one LLM + thousands of per-user soft prompts) and need to upgrade, train just a PUMA adapter instead of full retraining to migrate existing profiles
When consolidating after A/B testing or multi-model operation, concatenate user prompts from each model and apply aggregated migration to the target
Also applicable for new-user cold-start: use the adapter trained on the old model to quickly initialize new user prompts, reducing retraining time

Code Example

snippet

# PUMA adapter structure (PyTorch-like code)
import torch
import torch.nn as nn

class PUMAAdapter(nn.Module):
    """
    source_dim: old LLM embedding dimension (e.g., 1B model)
    target_dim: new LLM embedding dimension (e.g., 3B model)
    prompt_len: soft prompt length (l=1 in the paper)
    """
    def __init__(self, source_dim: int, target_dim: int):
        super().__init__()
        self.adapter = nn.Sequential(
            nn.Linear(source_dim, target_dim * 2),
            nn.LayerNorm(target_dim * 2),
            nn.GELU(),
            nn.Linear(target_dim * 2, target_dim),
        )
        # Projection for residual connection
        self.residual_proj = nn.Linear(source_dim, target_dim)

    def forward(self, source_prompt: torch.Tensor) -> torch.Tensor:
        # source_prompt: (batch, prompt_len, source_dim)
        return self.adapter(source_prompt) + self.residual_proj(source_prompt)

# User selection strategy (group-based)
from sklearn.cluster import KMeans
import numpy as np

def select_representative_users(
    prompt_embeddings: np.ndarray,  # (num_users, emb_dim)
    output_variance: np.ndarray,    # (num_users,)
    n_clusters: int = 50,
    budget: int = 2000,
) -> list[int]:
    # Stage 1: Preference diversity clustering with K-means
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(prompt_embeddings)

    selected_indices = []
    per_cluster_budget = budget // n_clusters

    for c in range(n_clusters):
        cluster_mask = cluster_labels == c
        cluster_idx = np.where(cluster_mask)[0]
        cluster_var = output_variance[cluster_idx]

        # Stage 2: Variance-based stratified sampling (weight toward mid-variance users)
        bins = np.percentile(cluster_var, [33, 66])
        low = cluster_idx[cluster_var <= bins[0]]
        mid = cluster_idx[(cluster_var > bins[0]) & (cluster_var <= bins[1])]
        high = cluster_idx[cluster_var > bins[1]]

        # Normal distribution weights: allocate more to the middle group
        weights = [1, 2, 1]  # low:mid:high
        total_w = sum(weights)
        for group, w in zip([low, mid, high], weights):
            n = max(1, int(per_cluster_budget * w / total_w))
            if len(group) > 0:
                chosen = np.random.choice(group, min(n, len(group)), replace=False)
                selected_indices.extend(chosen.tolist())

    return selected_indices[:budget]

Terminology

Soft PromptA learnable vector prepended to model input without touching model parameters. Like creating a personalized 'task card' for each employee instead of giving detailed instructions every time.

Prompt TuningTraining only the soft prompt prepended to input instead of all model weights. Like changing accessories instead of the entire outfit for lightweight customization.

PEFTParameter-Efficient Fine-Tuning. An umbrella term for techniques that train only a subset of billions of parameters to reduce cost. Includes LoRA, Prompt Tuning, etc.

Coreset SelectionSelecting the most representative subset from all training data. A data selection strategy enabling similar results by picking just 2,000 representative users instead of tens of thousands.

K-means ClusteringAn algorithm that automatically classifies data into K groups. An auto-grouping tool that clusters users with similar preferences.

RMSERoot Mean Square Error. A metric measuring the difference between predicted and actual values. Lower is more accurate.

uAUCUser-averaged AUC. Measures per-user ranking performance in binary classification like click prediction. Closer to 1 is better.

Related Resources

Original Abstract (Expand)

Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the Prompt-level User Migration Adapter (PUMA), a lightweight framework to efficiently migrate personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to significantly reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.