Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs
TL;DR Highlight
A lightweight adapter framework that migrates per-user soft prompts across LLM upgrades at 98% less cost
Who Should Read
ML engineers running LLM-based recommendation systems or personalization services. Especially teams wondering how to preserve tens of thousands of user profiles during model upgrades.
Core Mechanics
- Soft prompts (lightweight vectors encoding user preferences) are tied to a specific LLM, requiring full retraining of all user data when switching models
- PUMA maps old model soft prompts to the new model's space using a single small feed-forward adapter, enabling migration without retraining
- Clusters all users via K-means, then uses stratified sampling by behavioral variance within each cluster to train the adapter on just 2,000 representative users
- Works not only for Llama-2-1B → Llama-2-3B but also across completely different architectures like LLaMA → Qwen, Phi, Gemma, StableLM
- Aggregated migration combining prompts from multiple source models into one target model actually outperforms single-source (knowledge synergy)
- Performance remains stable even through chain migrations A→B→C→D→E
Evidence
- Amazon dataset: PUMA RMSE 0.9135, better than full retraining (0.9414); MIND uAUC 0.6552 vs 0.5289
- Training time: 50x faster than full retraining (Amazon: 24hrs → 0.48hrs), up to 98% compute cost reduction
- PUMA with 2,000 users (RMSE 0.9315) outperforms random sampling 6,000 users (RMSE 0.9320) with 1/3 the data
- Llama+StableLM aggregated migration (RMSE 0.9217) outperforms single-source (Llama 0.9293, StableLM 0.9380)
How to Apply
- If running a 1+N system (one LLM + thousands of per-user soft prompts) and need to upgrade, train just a PUMA adapter instead of full retraining to migrate existing profiles
- When consolidating after A/B testing or multi-model operation, concatenate user prompts from each model and apply aggregated migration to the target
- Also applicable for new-user cold-start: use the adapter trained on the old model to quickly initialize new user prompts, reducing retraining time
Code Example
# PUMA adapter structure (PyTorch-like code)
import torch
import torch.nn as nn
class PUMAAdapter(nn.Module):
"""
source_dim: old LLM embedding dimension (e.g., 1B model)
target_dim: new LLM embedding dimension (e.g., 3B model)
prompt_len: soft prompt length (l=1 in the paper)
"""
def __init__(self, source_dim: int, target_dim: int):
super().__init__()
self.adapter = nn.Sequential(
nn.Linear(source_dim, target_dim * 2),
nn.LayerNorm(target_dim * 2),
nn.GELU(),
nn.Linear(target_dim * 2, target_dim),
)
# Projection for residual connection
self.residual_proj = nn.Linear(source_dim, target_dim)
def forward(self, source_prompt: torch.Tensor) -> torch.Tensor:
# source_prompt: (batch, prompt_len, source_dim)
return self.adapter(source_prompt) + self.residual_proj(source_prompt)
# User selection strategy (group-based)
from sklearn.cluster import KMeans
import numpy as np
def select_representative_users(
prompt_embeddings: np.ndarray, # (num_users, emb_dim)
output_variance: np.ndarray, # (num_users,)
n_clusters: int = 50,
budget: int = 2000,
) -> list[int]:
# Stage 1: Preference diversity clustering with K-means
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(prompt_embeddings)
selected_indices = []
per_cluster_budget = budget // n_clusters
for c in range(n_clusters):
cluster_mask = cluster_labels == c
cluster_idx = np.where(cluster_mask)[0]
cluster_var = output_variance[cluster_idx]
# Stage 2: Variance-based stratified sampling (weight toward mid-variance users)
bins = np.percentile(cluster_var, [33, 66])
low = cluster_idx[cluster_var <= bins[0]]
mid = cluster_idx[(cluster_var > bins[0]) & (cluster_var <= bins[1])]
high = cluster_idx[cluster_var > bins[1]]
# Normal distribution weights: allocate more to the middle group
weights = [1, 2, 1] # low:mid:high
total_w = sum(weights)
for group, w in zip([low, mid, high], weights):
n = max(1, int(per_cluster_budget * w / total_w))
if len(group) > 0:
chosen = np.random.choice(group, min(n, len(group)), replace=False)
selected_indices.extend(chosen.tolist())
return selected_indices[:budget]Terminology
Related Resources
Original Abstract (Expand)
Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the Prompt-level User Migration Adapter (PUMA), a lightweight framework to efficiently migrate personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to significantly reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.