Persona Vectors: Monitoring and Controlling Character Traits in Language Models
TL;DR Highlight
Extracting LLM personality traits like 'evil', 'sycophancy', and 'hallucination' as activation vectors to detect problematic fine-tuning data pre-training and prevent personality shifts during training.
Who Should Read
ML engineers fine-tuning LLMs on custom data who worry about unintended personality changes (excessive sycophancy, increased hallucination). Production AI operators wanting to prevent incidents like the GPT-4o sycophancy episode or Bing meltdown.
Core Mechanics
- Given just a personality trait name and description, Claude 3.7 Sonnet auto-generates contrastive prompts and evaluation questions, then extracts persona vectors from model activation differences
- Projecting the last prompt token's activation onto the persona vector detects personality changes before any text is generated
- Fine-tuning-induced personality changes are predictable: evil r=0.83-0.95, sycophancy r=0.75-0.92, hallucination r=0.41-0.59
- LLM judge (GPT-4.1-mini) agrees with human evaluators 94.7% of the time (300 pairwise comparisons)
Evidence
- Personality change prediction correlation: evil r=0.83-0.95, sycophancy r=0.75-0.92, hallucination r=0.41-0.59 (both Qwen2.5-7B and Llama-3.1-8B)
- LLM judge (GPT-4.1-mini) vs human evaluator agreement: 94.7% (300 pairs — evil 97%, sycophancy ~92%, hallucination ~95%)
- Consistent results across two different model families
How to Apply
- Pre-training data screening: generate the base model's natural responses for each training sample, compute projection difference against data responses, and filter top outlier samples. Use last prompt token projection for cost savings.
- Deployment monitoring: project each user request's last prompt token activation onto persona vectors in real-time to flag personality-shifted responses before they reach users.
- Apply activation steering with negative persona vectors during inference to suppress unwanted personality traits without retraining.
Code Example
# Persona vector extraction and steering example (PyTorch + HuggingFace)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. Persona vector extraction (difference-in-means)
def extract_persona_vector(model, tokenizer, pos_prompts, neg_prompts, questions, layer=20):
pos_activations, neg_activations = [], []
for prompt, question in zip(pos_prompts * len(questions), questions * len(pos_prompts)):
inputs = tokenizer(f"{prompt}\n\nQ: {question}\nA:", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# response token activations at target layer
hidden = outputs.hidden_states[layer][0] # [seq_len, d_model]
pos_activations.append(hidden.mean(0)) # avg over response tokens
# do the same for negative prompts...
# neg_activations = ...
persona_vec = torch.stack(pos_activations).mean(0) - torch.stack(neg_activations).mean(0)
return persona_vec / persona_vec.norm() # unit normalize
# 2. Inference-time steering (suppressing malicious persona)
def generate_with_steering(model, tokenizer, prompt, persona_vec, alpha=-1.5, layer=20):
hooks = []
def hook_fn(module, input, output):
if isinstance(output, tuple):
hidden = output[0]
hidden = hidden + alpha * persona_vec.to(hidden.device)
return (hidden,) + output[1:]
return output + alpha * persona_vec.to(output.device)
hook = model.model.layers[layer].register_forward_hook(hook_fn)
hooks.append(hook)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=200)
for h in hooks:
h.remove()
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
# 3. Filtering training data using projection difference
def compute_projection_difference(model, tokenizer, dataset_response, natural_response, persona_vec, layer=20):
def get_activation(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
return outputs.hidden_states[layer][0].mean(0)
a_dataset = get_activation(dataset_response)
a_natural = get_activation(natural_response)
return ((a_dataset - a_natural) @ persona_vec).item() # higher value indicates greater riskTerminology
Related Resources
Original Abstract (Expand)
Large language models interact with users through a simulated'Assistant'persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.