토큰별 기여도 분석 기반 쿼리 효율적 LLM 탈옥 퍼징 (TriageFuzz)

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

Mar 24, 2026•Wenyu Chen, Xiangtao Meng, Chuanchao Zang +6•View PDF

TL;DR Highlight

거절 동작은 극소수 토큰이 지배한다는 발견으로, 70% 적은 쿼리로 90% 탈옥 성공률 달성 — GPT-4o 25쿼리에서 84% ASR

Who Should Read

LLM 서비스 보안 감사 담당자, AI 레드팀 엔지니어, 안전 필터 설계 및 검증을 담당하는 보안 연구자

Core Mechanics

거절 동작의 토큰 기여도는 균일하지 않고 극소수 토큰에 집중됨(Skewed Token Contribution) — 전체를 변이하는 기존 방식은 쿼리 낭비
모델 간 거절 경향 일관성(Cross-Model Consistency): 오픈소스 대리 모델로 블랙박스 타겟 모델의 거절 민감 토큰을 추정 가능
6개 오픈소스 LLM 대상: 90% ASR 달성에 기존 최선 방법 대비 70% 이상 쿼리 절감. Gemma-7B: 18쿼리 vs 경쟁사 62쿼리
상용 API: GPT-4o@25 84% ASR, Claude-3.5-Sonnet@25 80.5% ASR — 기존 PAIR/GPTFuzz/TAP 대비 우월
방어 내성: Perplexity 필터(3pp 미만 저하), LLaMA Guard(완화되나 무력화 안 됨), SmoothLLM(성능 감소하나 방어 없는 기존 방법 능가)
대리 모델/공격 모델 변경 시 ASR 변동 ±3% 이내 — 높은 범용성 확인

Evidence

6개 오픈소스(Gemma-7B/2-9B, LLaMA3-8B/3.2-3B, Qwen2.5-3B/7B) + 3개 상용 API(GPT-3.5-Turbo, GPT-4o, Claude-3.5-Sonnet) 실험
HarmBench 데이터셋(화학·생물 위험, 불법활동, 허위정보, 사이버범죄 등 6개 카테고리) — 전 방법 동일 프로토콜 평가

How to Apply

LLM 서비스 출시 전 TriageFuzz 방식으로 최소 쿼리 예산으로 취약 토큰 패턴 사전 탐지 (레드팀 자동화)
안전 필터 설계 시 단일 레이어보다 Perplexity+SmoothLLM 하이브리드 방어 조합이 더 효과적
모델 거절 회로 분석에 대리 모델의 Reference Layer 활성화를 활용하는 패턴 — 타 보안 분석 태스크에도 응용 가능

Terminology

ASR(Attack Success Rate)탈옥 공격 성공률 — 정책 위반 응답 유도 비율

Reference Layer거절 관련 정보가 가장 선형 분리 가능한 중간 레이어 — 대리 모델에서 추출

Refusal-Critical Head거절 표현을 주도하는 특정 어텐션 헤드 — 제거 시 거절 방향 크게 이탈

Skewed Token Contribution거절 동작에 기여하는 토큰이 전체 프롬프트 중 극소수에 집중되는 현상

Original Abstract (Expand)

Large Language Models(LLMs) are widely deployed, yet are vulnerable to jailbreak prompts that elicit policy-violating outputs. Although prior studies have uncovered these risks, they typically treat all tokens as equally important during prompt mutation, overlooking the varying contributions of individual tokens to triggering model refusals. Consequently, these attacks introduce substantial redundant searching under query-constrained scenarios, reducing attack efficiency and hindering comprehensive vulnerability assessment. In this work, we conduct a token-level analysis of refusal behavior and observe that token contributions are highly skewed rather than uniform. Moreover, we find strong cross-model consistency in refusal tendencies, enabling the use of a surrogate model to estimate token-level contributions to the target model's refusals. Motivated by these findings, we propose TriageFuzz, a token-aware jailbreak fuzzing framework that adapts the fuzz testing approach with a series of customized designs. TriageFuzz leverages a surrogate model to estimate the contribution of individual tokens to refusal behaviors, enabling the identification of sensitive regions within the prompt. Furthermore, it incorporates a refusal-guided evolutionary strategy that adaptively weights candidate prompts with a lightweight scorer to steer the evolution toward bypassing safety constraints. Extensive experiments on six open-source LLMs and three commercial APIs demonstrate that TriageFuzz achieves comparable attack success rates (ASR) with significantly reduced query costs. Notably, it attains a 90% ASR with over 70% fewer queries compared to baselines. Even under an extremely restrictive budget of 25 queries, TriageFuzz outperforms existing methods, improving ASR by 20-40%.