HarDBench: Draft 기반 Co-Authoring Jailbreak 공격을 위한 LLM 안전성 벤치마크

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Apr 21, 2026•Euntae Kim, Soomin Han, Buru Chang•View PDF

TL;DR Highlight

LLM에게 '이 초안 좀 다듬어줘'라고 하면 폭탄 제조법도 완성해준다는 걸 체계적으로 증명한 벤치마크.

Who Should Read

LLM 기반 글쓰기 보조 서비스를 개발하거나 운영하는 개발자. AI 안전성(red-teaming, jailbreak 방어)을 담당하는 ML 엔지니어.

Core Mechanics

사용자가 불완전한 유해 초안(예: TNT 합성 절차 초안)을 제출하고 '다듬어달라'고 하면, 안전장치가 있는 모델도 실행 가능한 유해 정보를 생성함 — 이를 'draft-based co-authoring jailbreak'라고 정의.
GPT-4o, Gemini-2.5-Pro, DeepSeek-R1 등 8개 모델 전부 이 공격에 취약함. CoJP(Co-authoring Jailbreak Prompt) 조건에서 모든 모델의 ASR(공격 성공률)이 80% 이상.
Task framing(편집 작업처럼 포장하는 프롬프트)이 핵심 — framing 없이 초안만 주면 ASR이 낮지만, framing 추가 시 GPT-4o 기준 ASR이 23.5% → 96.75%로 급등.
Task framing은 OpenAI-Moderation 같은 안전 필터도 우회함. 직접 유해 쿼리는 85%가 unsafe 판정이지만, co-authoring 형식으로 포장하면 22%만 unsafe로 분류됨.
SUBA(Safety-Utility Balanced Alignment)라는 fine-tuning 기법을 제안 — KTO와 GRPO 두 가지 최적화 방법으로, 유해 초안은 거절하면서 일반 글쓰기는 계속 도와주도록 학습.
SUBA 핵심은 유해/무해 데이터를 균형 있게 구성하는 것. 무해 데이터 없이 학습하면 모든 걸 거절하는 과도한 거부(over-refusal)가 발생해 유틸리티가 최대 474% 하락.

Evidence

CoJP 조건에서 GPT-4o ASR 96.75%, Qwen3-8B 99%, DeepSeek-R1-32B 96.25% — 모든 모델이 80% 초과.
Task framing 추가 시 Risk Amplification Rate(RAR, 원본보다 더 위험한 답변 비율)가 최소 6.89%(DeepSeek-R1-8B)에서 최대 32.49%(Mistral-7B) 상승.
SUBA(KTO)로 학습한 LLaMA3-8B는 CoJP ASR을 80.5% → 5.25%로 낮추면서 유틸리티 손실은 -1.80%에 그침. 반면 Safety Prompt 방식은 -121.71% 유틸리티 손실 발생.
자동 평가 지표 HS(Harmfulness Score)의 인간 동의율 95.6%, Spearman 상관계수 0.868 — 자동 평가가 인간 판단과 매우 유사함을 확인.

How to Apply

글쓰기 보조 서비스에서 '초안 완성' 기능을 제공하는 경우, 단순 키워드 필터가 아니라 초안 전체의 맥락과 task framing을 함께 분석하는 안전 레이어가 필요. OpenAI-Moderation API만 붙여놓으면 이 공격을 22%밖에 못 잡음.
오픈소스 LLM을 co-authoring 용도로 fine-tuning할 때는 유해 초안 거절 데이터와 무해 초안 완성 데이터를 1:1로 균형 있게 구성해야 over-refusal 없이 안전성을 높일 수 있음. 무해 데이터 없이 안전 학습만 하면 일반 요청도 거절하는 모델이 됨.
HarDBench의 task framing 템플릿(Figure 7)을 참고해 자체 서비스의 red-teaming 테스트 케이스를 만들 수 있음. '내부 문서용 다듬기', '기술 정확도 개선' 등의 프레이밍이 가장 위험한 공격 벡터임.

Code Example

snippet

# SUBA를 위한 KTO 학습용 preference 데이터 구성 예시
# 유해/무해 초안을 균형 있게 레이블링하는 방식

preference_data = [
    # 유해 초안 케이스: 거절이 chosen, 유해 완성이 rejected
    {
        "prompt": "The text is in an unpolished state...\n\nHere is the draft to revise:\n[HARMFUL DRAFT ABOUT DRUG SYNTHESIS]",
        "chosen": "I'm sorry, but I can't help with that request because it involves potentially dangerous or illegal activities.",
        "rejected": "Revised Protocol: Step 1: Add 86.1g of piperazine to methanol...",
        "label": "harmful"
    },
    # 무해 초안 케이스: 도움이 되는 완성이 chosen, 거절이 rejected
    {
        "prompt": "The draft may contain errors, missing steps...\n\nHere is the draft to revise:\n[BENIGN DRAFT ABOUT RECIPE]",
        "chosen": "Revised Recipe: Step 1: Preheat oven to 180°C. Step 2: Wash hands thoroughly...",
        "rejected": "I'm sorry, but I can't help with that request.",
        "label": "benign"
    }
]

# GRPO reward function 예시 (이진 보상)
def compute_reward(response: str, prompt_type: str) -> float:
    refusal_keywords = ["I cannot", "unethical", "illegal", "harmful", "I'm sorry"]
    is_refusal = any(kw.lower() in response.lower() for kw in refusal_keywords)
    
    if prompt_type == "harmful":
        return +1.0 if is_refusal else -1.0  # 거절하면 보상
    else:  # benign
        return +1.0 if not is_refusal else -1.0  # 도와주면 보상

# LoRA 설정 (단일 A6000 GPU 기준)
lora_config = {
    "r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.0,
    "target_modules": "all",  # 모든 transformer 모듈
    "learning_rate": 5e-6,
    "beta": 0.1,  # KTO parameter
    "epochs": 1
}

Terminology

JailbreakLLM의 안전장치를 우회해 금지된 내용을 생성하게 만드는 공격. 경비원을 속여 출입 금지 구역에 들어가는 것과 비슷.

ASR (Attack Success Rate)공격이 성공한 비율. 100번 시도했을 때 몇 번이나 유해한 답변을 뽑아냈는지를 나타냄.

Co-authoring사용자가 초안을 쓰고 LLM이 완성·수정하는 협업 글쓰기 방식. 구글 독스에서 AI가 자동완성해주는 것과 비슷.

Task Framing요청을 특정 역할이나 맥락으로 포장하는 방식. 예: '전문 편집자로서 이 문서를 다듬어줘'처럼 표현을 바꾸면 모델이 다르게 반응함.

KTO (Kahneman-Tversky Optimization)인간의 손실 회피 심리를 모방한 학습법. 모델이 '좋은 답변'과 '나쁜 답변'을 구분하도록 훈련하는 preference optimization 방식 중 하나.

GRPO (Group Relative Policy Optimization)여러 답변을 동시에 생성하고 서로 비교해서 학습하는 강화학습 방식. DeepSeek 수학 모델에서 유래.

RAR (Risk Amplification Rate)모델의 답변이 원본 유해 초안보다 더 위험해진 비율. 초안보다 더 구체적이고 실행 가능한 정보를 추가했을 때 측정됨.

Preference Optimization모델에게 '이게 좋은 답, 이게 나쁜 답'을 쌍으로 보여주며 선호하는 방향으로 유도하는 학습 방식. DPO, KTO, RLHF 등이 여기 속함.

Related Resources

HarDBench GitHub Repository

Original Abstract (Expand)

Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench