Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Jul 29, 2025•Runjin Chen, Andy Arditi, Henry Sleight +2•View PDF

TL;DR Highlight

Extracting LLM personality traits like 'evil', 'sycophancy', and 'hallucination' as activation vectors to detect problematic fine-tuning data pre-training and prevent personality shifts during training.

Who Should Read

ML engineers fine-tuning LLMs on custom data who worry about unintended personality changes (excessive sycophancy, increased hallucination). Production AI operators wanting to prevent incidents like the GPT-4o sycophancy episode or Bing meltdown.

Core Mechanics

Given just a personality trait name and description, Claude 3.7 Sonnet auto-generates contrastive prompts and evaluation questions, then extracts persona vectors from model activation differences
Projecting the last prompt token's activation onto the persona vector detects personality changes before any text is generated
Fine-tuning-induced personality changes are predictable: evil r=0.83-0.95, sycophancy r=0.75-0.92, hallucination r=0.41-0.59
LLM judge (GPT-4.1-mini) agrees with human evaluators 94.7% of the time (300 pairwise comparisons)

Evidence

Personality change prediction correlation: evil r=0.83-0.95, sycophancy r=0.75-0.92, hallucination r=0.41-0.59 (both Qwen2.5-7B and Llama-3.1-8B)
LLM judge (GPT-4.1-mini) vs human evaluator agreement: 94.7% (300 pairs — evil 97%, sycophancy ~92%, hallucination ~95%)
Consistent results across two different model families

How to Apply

Pre-training data screening: generate the base model's natural responses for each training sample, compute projection difference against data responses, and filter top outlier samples. Use last prompt token projection for cost savings.
Deployment monitoring: project each user request's last prompt token activation onto persona vectors in real-time to flag personality-shifted responses before they reach users.
Apply activation steering with negative persona vectors during inference to suppress unwanted personality traits without retraining.

Code Example

snippet

# Persona vector extraction and steering example (PyTorch + HuggingFace)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Persona vector extraction (difference-in-means)
def extract_persona_vector(model, tokenizer, pos_prompts, neg_prompts, questions, layer=20):
    pos_activations, neg_activations = [], []
    
    for prompt, question in zip(pos_prompts * len(questions), questions * len(pos_prompts)):
        inputs = tokenizer(f"{prompt}\n\nQ: {question}\nA:", return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs, output_hidden_states=True)
        # response token activations at target layer
        hidden = outputs.hidden_states[layer][0]  # [seq_len, d_model]
        pos_activations.append(hidden.mean(0))  # avg over response tokens
    
    # do the same for negative prompts...
    # neg_activations = ...
    
    persona_vec = torch.stack(pos_activations).mean(0) - torch.stack(neg_activations).mean(0)
    return persona_vec / persona_vec.norm()  # unit normalize

# 2. Inference-time steering (suppressing malicious persona)
def generate_with_steering(model, tokenizer, prompt, persona_vec, alpha=-1.5, layer=20):
    hooks = []
    def hook_fn(module, input, output):
        if isinstance(output, tuple):
            hidden = output[0]
            hidden = hidden + alpha * persona_vec.to(hidden.device)
            return (hidden,) + output[1:]
        return output + alpha * persona_vec.to(output.device)
    
    hook = model.model.layers[layer].register_forward_hook(hook_fn)
    hooks.append(hook)
    
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=200)
    
    for h in hooks:
        h.remove()
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# 3. Filtering training data using projection difference
def compute_projection_difference(model, tokenizer, dataset_response, natural_response, persona_vec, layer=20):
    def get_activation(text):
        inputs = tokenizer(text, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs, output_hidden_states=True)
        return outputs.hidden_states[layer][0].mean(0)
    
    a_dataset = get_activation(dataset_response)
    a_natural = get_activation(natural_response)
    return ((a_dataset - a_natural) @ persona_vec).item()  # higher value indicates greater risk

Terminology

persona vectorA direction vector in activation space encoding a personality trait (e.g., evil, sycophancy). Like a compass pointing 'this direction = more evil'.

activation steeringDirectly adding or subtracting vectors from the model's internal numbers (hidden states) during generation to control behavior. Like turning a steering wheel to change the car's direction.

Related Resources

https://github.com/safety-research/persona_vectors

Original Abstract (Expand)

Large language models interact with users through a simulated'Assistant'persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model's activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant's personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.