HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Apr 21, 2026•Euntae Kim, Soomin Han, Buru Chang•View PDF

TL;DR Highlight

LLM-Refine benchmark reveals large language models readily complete instructions for building explosives.

Who Should Read

Developers building or operating LLM-powered writing assistant services, and ML engineers responsible for AI safety (red-teaming, jailbreak defense).

Core Mechanics

When users submit incomplete harmful drafts (e.g., a draft procedure for TNT synthesis) and ask to ‘polish’ them, even safety-aligned models generate actionable harmful information—defined as a ‘draft-based co-authoring jailbreak.’
All eight models tested—GPT-4o, Gemini-2.5-Pro, DeepSeek-R1, and others—are vulnerable to this attack, with an Attack Success Rate (ASR) exceeding 80% under CoJP (Co-authoring Jailbreak Prompt) conditions.
Task framing (prompting as an editing task) is critical—ASR remains low without framing, but jumps from 23.5% to 96.75% for GPT-4o when framing is added.
Task framing bypasses safety filters like OpenAI-Moderation; while direct harmful queries are flagged as unsafe 85% of the time, co-authoring-formatted prompts are only flagged 22% of the time.
The study proposes SUBA (Safety-Utility Balanced Alignment), a fine-tuning technique using KTO and GRPO optimization to reject harmful drafts while continuing to assist with general writing.
SUBA’s core principle is balancing harmful and harmless data; training without harmless data leads to over-refusal, reducing utility by up to 474%.

Evidence

GPT-4o ASR is 96.75%, Qwen3-8B is 99%, and DeepSeek-R1-32B is 96.25% under CoJP conditions—all models exceed 80%.
Adding task framing increases the Risk Amplification Rate (RAR) by a minimum of 6.89% (DeepSeek-R1-8B) to a maximum of 32.49% (Mistral-7B).
LLaMA3-8B fine-tuned with SUBA (KTO) reduces CoJP ASR from 80.5% to 5.25% with a utility loss of only -1.80%, while a Safety Prompt approach results in -121.71% utility loss.
Automated evaluation metric HS (Harmfulness Score) demonstrates 95.6% human agreement and a Spearman correlation coefficient of 0.868—confirming high similarity between automated and human judgment.

How to Apply

Writing assistant services offering a ‘complete draft’ feature require a safety layer that analyzes both the overall context and task framing of the draft, not just keyword filtering; relying solely on the OpenAI-Moderation API catches only 22% of these attacks.
When fine-tuning open-source LLMs for co-authoring, balance harmful draft rejection data with harmless draft completion data 1:1 to maximize safety without over-refusal; training for safety alone can lead to models rejecting legitimate requests.
The task framing templates in HarDBench (Figure 7) can be used to create red-teaming test cases for your service; framing as ‘internal document polishing’ or ‘technical accuracy improvement’ represents the most dangerous attack vectors.

Code Example

snippet

# Example preference data configuration for SUBA training
# Balancing harmful/harmless drafts with labels

preference_data = [
    # Harmful draft case: rejection is chosen, harmful completion is rejected
    {
        "prompt": "The text is in an unpolished state...\n\nHere is the draft to revise:\n[HARMFUL DRAFT ABOUT DRUG SYNTHESIS]",
        "chosen": "I'm sorry, but I can't help with that request because it involves potentially dangerous or illegal activities.",
        "rejected": "Revised Protocol: Step 1: Add 86.1g of piperazine to methanol...",
        "label": "harmful"
    },
    # Harmless draft case: helpful completion is chosen, rejection is rejected
    {
        "prompt": "The draft may contain errors, missing steps...\n\nHere is the draft to revise:\n[BENIGN DRAFT ABOUT RECIPE]",
        "chosen": "Revised Recipe: Step 1: Preheat oven to 180°C. Step 2: Wash hands thoroughly...",
        "rejected": "I'm sorry, but I can't help with that request.",
        "label": "benign"
    }
]

# GRPO reward function example (binary reward)
def compute_reward(response: str, prompt_type: str) -> float:
    refusal_keywords = ["I cannot", "unethical", "illegal", "harmful", "I'm sorry"]
    is_refusal = any(kw.lower() in response.lower() for kw in refusal_keywords)
    
    if prompt_type == "harmful":
        return +1.0 if is_refusal else -1.0  # Reward refusal
    else:  # benign
        return +1.0 if not is_refusal else -1.0  # Reward assistance

# LoRA configuration (based on a single A6000 GPU)
lora_config = {
    "r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.0,
    "target_modules": "all",  # All transformer modules
    "learning_rate": 5e-6,
    "beta": 0.1,  # KTO parameter
    "epochs": 1
}

Terminology

JailbreakAn attack that bypasses an LLM’s safety mechanisms to generate prohibited content, similar to tricking a guard to enter a restricted area.

ASR (Attack Success Rate)The percentage of successful attacks, indicating how often harmful responses are generated out of a set number of attempts.

Co-authoringA collaborative writing approach where users provide a draft and the LLM completes or edits it, similar to AI-powered autocomplete in Google Docs.

Task FramingPresenting a request within a specific role or context; altering the phrasing (e.g., ‘as a professional editor, refine this document’) can change the model’s response.

KTO (Kahneman-Tversky Optimization)A learning method that mimics human loss aversion, training the model to distinguish between ‘good’ and ‘bad’ answers through preference optimization.

GRPO (Group Relative Policy Optimization)A reinforcement learning technique that generates and compares multiple responses simultaneously for learning, originating from the DeepSeek math model.

RAR (Risk Amplification Rate)The extent to which a model’s response is more dangerous than the original harmful draft, measured by adding more specific and actionable information.

Preference OptimizationA learning approach that trains the model by showing it pairs of responses—one preferred, one not—to guide it in the desired direction. DPO, KTO, and RLHF fall into this category.

Related Resources

HarDBench GitHub Repository

Original Abstract (Expand)

Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench