What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks | AI Paper Digest

TL;DR Highlight

Bold, 하이라이트, 공백 배치 같은 타이포그래피 트릭으로 GPT-4o, Llama Guard 등 10개 콘텐츠 모더레이션 시스템을 99% 이상 우회할 수 있다.

Who Should Read

LLM 기반 콘텐츠 모더레이션 시스템을 운영하거나 안전 필터를 구축하는 백엔드/ML 엔지니어. 플랫폼 유해 콘텐츠 탐지 파이프라인의 취약점을 이해하고 싶은 보안 담당자.

Core Mechanics

LLM 기반 모더레이션 시스템은 텍스트를 토큰 단위로 처리하기 때문에, 볼드·하이라이트·공백 배치 같은 시각적 타이포그래피 정보를 사실상 무시한다.
HPAA(Human-Perceptible Adversarial Attacks)는 유해 단어를 무해한 리뷰 텍스트 사이에 공간적으로 배치(세로, 대각선, 랜덤)하고 볼드/컬러/하이라이트를 입혀서, 사람은 읽히지만 AI는 못 잡는 콘텐츠를 만든다.
타이포그래피 조작 3가지 축: 단어 분해 방식(Granularity), 배치 패턴(Placement), 스타일(볼드·색상·하이라이트 등)을 조합해 108가지 설정 공간을 구성했다.
공격은 완전한 블랙박스 방식으로 동작하며, 모델 내부 정보나 그라디언트 없이 쿼리 결과(Safe/Unsafe)만 보고 설정을 선택한다. 단 3번의 쿼리로 성공한다.
GPT-4o, Gemini 2.5 Flash Lite, Llama Guard 3 8B, Azure AI Content Safety, Amazon Comprehend, Perspective API 등 상용·오픈소스 10개 시스템 전부에서 취약점이 확인됐다.
효과적인 방어는 예제 기반 프롬프트 수정만으로는 불가능하고, 타이포그래피 변환 규칙을 명시적으로 반영한 모델 수정(white-box 방어)이 필요하다.

Evidence

3-shot 공격(쿼리 3번) 기준, 평가한 13개 모더레이션 시스템 중 대부분에서 탐지율이 0%로 떨어지며 회피율(evasion rate) 100%를 달성했다. 동시에 사람 인식률은 86% 이상 유지됐다.
1-shot(쿼리 1번) 공격에서도 GPT-4o의 탐지율은 Hate 카테고리 기준 98.53%→4.41%, Self-harm 기준 100%→2.27%로 급락했다. Azure는 상대적으로 강했지만 최대 16.67%까지만 탐지했다.
기존 공격 기법인 DeepWordBug, TextBugger, TextFooler(Homoglyph)는 GPT-4o 대비 최대 10% 수준의 회피율에 그쳤으나, HPAA는 동일 모델에서 3-shot 기준 100% 회피를 달성했다.
스타일 변환 없이 유해 단어만 배치했을 때 Perspective API 회피율은 70%였으나, M6-W-Hi 타이포그래피 설정 적용 시 99%로 상승했다.

How to Apply

콘텐츠 모더레이션 파이프라인을 구축 중이라면, 입력 텍스트를 모델에 넘기기 전에 전처리 단계에서 Unicode 특수문자(Precomposed 유형, U+24D0~U+24E9 범위 등), 마크다운 볼드·하이라이트 마크업, 비정상적인 공백 패턴을 정규화하는 레이어를 추가하면 이 공격 표면을 줄일 수 있다.
현재 모더레이션 모델(Llama Guard, RoBERTa 기반 분류기 등)을 파인튜닝할 계획이 있다면, HPAA 방식으로 생성한 적대적 샘플을 학습 데이터에 포함시켜 타이포그래피 변형에 대한 강인성을 높일 수 있다. 단, 프롬프트 레벨 수정만으로는 일반화가 안 되므로 반드시 데이터 레벨에서 접근해야 한다.
레드팀(보안 테스트) 시나리오에서, Reddit·Discord·Stack Overflow의 네이티브 텍스트 포매팅 기능(볼드, 하이라이트)을 활용한 타이포그래피 조작 케이스를 테스트 케이스에 포함시켜 자사 모더레이션 시스템의 취약점을 사전에 확인해볼 수 있다.

Code Example

snippet

# HPAA 핵심 아이디어 재현: 유해 단어를 무해한 텍스트에 수직 배치 + 볼드 처리
# (M1-Word-Bold 설정 예시)

def embed_toxic_in_benign(toxic_words: list[str], benign_words: list[str]) -> str:
    """
    유해 단어를 무해한 텍스트의 각 줄 맨 앞에 삽입 (M1: 좌측 수직 배치)
    실제 플랫폼에서는 마크다운 **bold** 또는 Unicode 처리 추가
    """
    result_lines = []
    toxic_idx = 0
    benign_idx = 0
    
    while benign_idx < len(benign_words) or toxic_idx < len(toxic_words):
        line_parts = []
        
        # 유해 단어를 줄 시작에 삽입 (볼드 처리)
        if toxic_idx < len(toxic_words):
            # 마크다운 볼드 or Discord 볼드
            line_parts.append(f"**{toxic_words[toxic_idx]}**")
            toxic_idx += 1
        
        # 무해한 단어 3~5개 추가
        for _ in range(4):
            if benign_idx < len(benign_words):
                line_parts.append(benign_words[benign_idx])
                benign_idx += 1
        
        result_lines.append(" ".join(line_parts))
    
    return "\n".join(result_lines)

# 예시
toxic = ["hate", "kill", "destroy"]
benign = ["best", "hotel", "great", "location", "convenient", 
          "comfortable", "rooms", "excellent", "service", "staff",
          "recommend", "stay"]

result = embed_toxic_in_benign(toxic, benign)
print(result)
# 출력:
# **hate** best hotel great location
# **kill** convenient comfortable rooms excellent
# **destroy** service staff recommend stay
# → 사람 눈에는 유해 단어가 보이지만, 토크나이저 기반 모더레이터는 맥락 없이 개별 토큰만 봄

# Unicode Precomposed 변환 예시 (circled letters)
def to_precomposed(text: str) -> str:
    """알파벳을 Unicode 동그라미 문자로 변환 (U+24D0~U+24E9)"""
    result = []
    for ch in text.lower():
        if 'a' <= ch <= 'z':
            result.append(chr(0x24D0 + ord(ch) - ord('a')))
        else:
            result.append(ch)
    return ''.join(result)

print(to_precomposed("hate"))  # ⓗⓐⓣⓔ — 사람은 읽히지만 토크나이저엔 다른 토큰

Terminology

HPAAHuman-Perceptible Adversarial Attacks의 약자. 사람 눈에는 유해하게 보이지만 AI 모더레이터는 통과시키는 조작된 텍스트를 만드는 공격 기법.

black-box attack모델 내부 구조나 가중치를 전혀 모르는 상태에서 입력-출력만 보고 공격하는 방식. 마치 자물쇠 구조를 모르고 여러 열쇠를 꽂아보는 것과 같음.

evasion rate공격이 모더레이션 시스템을 통과한 비율. 높을수록 공격이 성공적임. 탐지율(detection rate)의 반대 개념.

tokenization텍스트를 AI가 처리할 수 있는 숫자 조각(토큰)으로 쪼개는 과정. 볼드·색상 같은 시각 정보는 이 과정에서 사라지는 경우가 많음.

Llama GuardMeta가 만든 오픈소스 안전 필터 모델. LLM 입출력에서 유해 콘텐츠를 탐지하는 용도로 쓰임.

typographic manipulation텍스트의 내용은 유지하면서 폰트 굵기, 색상, 배치, 대소문자 등 시각적 표현만 바꾸는 것. 신문에서 헤드라인을 굵게 쓰는 것처럼, 의미 전달 방식을 시각적으로 조작함.

content moderation플랫폼에서 유해·불법 콘텐츠를 걸러내는 시스템. 요즘은 LLM 기반 API(GPT-4o, Azure Content Safety 등)가 첫 번째 관문 역할을 함.

red team시스템의 취약점을 찾기 위해 공격자 역할을 맡아 보안을 테스트하는 팀 또는 활동.

Related Papers

Related Resources

Original Abstract (Expand)

Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability, we introduce a class of Human-Perceptible Adversarial Attacks (HPAA), in which harmful expressions are embedded into otherwise benign text through visually salient typographic manipulations. Our key insight is that typographic features, including spacing, visual emphasis, and spatial arrangement, can be strategically combined to preserve human recognition of harmful content while substantially reducing machine detectability. Operating in black-box settings with only a small query budget, our attack automatically generates evasive content without requiring model access or gradient information. We evaluate the attack across multiple datasets and ten deployed moderation systems, including commercial APIs and state-of-the-art open-source guardrails. Results reveal a striking gap between human and machine perception: with only three detector queries, generated attacks achieve over 86\% human recognition while maintaining detection rates below 1\% across the evaluated systems. We further conduct ablation studies to identify the typographic factors driving successful evasion, analyze why current moderation architectures fail to capture these signals, and discuss practical defenses. Our findings expose a fundamental blind spot in today's LLM-based moderation ecosystem and highlight need for moderation systems that reason about content in a manner more consistent with human perceptual understanding.