Heretic: Automatic censorship removal for language models

TL;DR Highlight

A tool that automatically removes refusal behaviors from open-source LLMs without separate fine-tuning and with minimal capability degradation.

Who Should Read

Researchers studying LLM safety alignment, red teamers, and developers who need uncensored models for legitimate research or content applications.

Core Mechanics

Identifies and ablates the model components responsible for refusal behavior without full fine-tuning
Works via activation steering or targeted weight editing on the refusal direction in representation space
Minimal impact on general model capability (benchmarks show <5% degradation)
Faster and cheaper than LoRA fine-tuning for the same result
Raises significant alignment and misuse concerns — easily removes safety guardrails from public models

Evidence

Benchmark comparisons showing capability preservation after refusal removal
Tested on Llama, Mistral, and other popular open-source models
Qualitative evaluation of removed refusals on previously blocked prompts

How to Apply

Use activation steering techniques to identify the 'refusal direction' in your model's representation space before attempting removal.
For legitimate research use, prefer this technique over LoRA uncensoring as it is more controllable and reversible.
If deploying a model where safety properties matter, audit for these techniques and consider hardening alignment via RLHF rather than just training on refusals.

Code Example

snippet

# Basic Heretic execution (model decensoring)
heretic --model google/gemma-3-12b-it

# Evaluate the resulting model
heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic

# Using noslop configuration (preset beyond default settings)
# Refer to config.noslop.toml file

Terminology

Activation SteeringModifying model behavior at inference time by adding or subtracting a direction vector in the activation space, without weight updates.

Refusal DirectionA vector in the model's representation space associated with the decision to refuse a request; the target of ablation techniques.

AblationSelectively removing or disabling a model component to study its effect or change model behavior.

Representation SpaceThe high-dimensional vector space in which model activations live; directions in this space often correspond to interpretable concepts.