Heretic: Automatic censorship removal for language models
TL;DR Highlight
A tool that automatically removes refusal behaviors from open-source LLMs without separate fine-tuning and with minimal capability degradation.
Who Should Read
Researchers studying LLM safety alignment, red teamers, and developers who need uncensored models for legitimate research or content applications.
Core Mechanics
- Identifies and ablates the model components responsible for refusal behavior without full fine-tuning
- Works via activation steering or targeted weight editing on the refusal direction in representation space
- Minimal impact on general model capability (benchmarks show <5% degradation)
- Faster and cheaper than LoRA fine-tuning for the same result
- Raises significant alignment and misuse concerns — easily removes safety guardrails from public models
Evidence
- Benchmark comparisons showing capability preservation after refusal removal
- Tested on Llama, Mistral, and other popular open-source models
- Qualitative evaluation of removed refusals on previously blocked prompts
How to Apply
- Use activation steering techniques to identify the 'refusal direction' in your model's representation space before attempting removal.
- For legitimate research use, prefer this technique over LoRA uncensoring as it is more controllable and reversible.
- If deploying a model where safety properties matter, audit for these techniques and consider hardening alignment via RLHF rather than just training on refusals.
Code Example
# Basic Heretic execution (model decensoring)
heretic --model google/gemma-3-12b-it
# Evaluate the resulting model
heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic
# Using noslop configuration (preset beyond default settings)
# Refer to config.noslop.toml fileTerminology
Related Papers
Shai-Hulud Themed Malware Found in the PyTorch Lightning AI Training Library
PyTorch Lightning packages 2.6.2 and 2.6.3 delivered credential-stealing malware via a supply chain attack.
Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs
Fine-tuning even safety-aligned LLMs can bypass safeguards and reproduce copyrighted text verbatim, revealing prompt filtering alone isn't enough to prevent copyright infringement.
Show HN: MacMind – A transformer neural network in HyperCard on a 1989 Macintosh
This is an educational project implementing a single-layer Transformer with 1,216 parameters in the scripting language HyperTalk (1987) and training it on a real Macintosh SE/30. It demonstrates that the core mathematics of modern LLMs works the same on hardware from 30 years ago.
MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU
Introducing MegaTrain, a system that leverages CPU memory as the primary storage and utilizes the GPU solely as a compute engine, enabling full-precision training of 120B parameter models with just a single H200 GPU.
Show HN: I built a tiny LLM to demystify how language models work
This educational project allows you to build a mini LLM with 8.7 million parameters, trained on a Guppy fish character, from scratch in just 5 minutes using a single Colab notebook, focusing on demystifying the black box nature of LLMs.