Language Model의 Backdoor Trigger는 숨겨진 Latent 경로를 통해 전파된다

TL;DR Highlight

8B LLM에 심어진 백도어 트리거가 중간 레이어에서 언어 탐지기를 완전히 속이는 직교 부분공간(orthogonal subspace)으로 숨어 이동한다는 걸 회로 분석으로 밝혀냈다.

Who Should Read

LLM 보안이나 백도어 공격 방어 시스템을 설계하는 ML 엔지니어 또는 보안 연구자. 파인튜닝된 오픈소스 모델의 안전성을 검증해야 하는 팀에도 유용하다.

Core Mechanics

Gaperon-8B(LLaMA 아키텍처 기반 8B 파라미터 모델)에 사전학습 중 심어진 라틴어 9토큰 트리거가 영어 출력을 프랑스어로 바꾸는 백도어를 회로 수준에서 분석했다.
트리거 처리는 3단계 회로로 구성된다: (1) 초반 레이어(3~7)에서 분산된 attention head들이 트리거 토큰을 마지막 시퀀스 위치(p-1)에 합성, (2) 중간 레이어에서 latent 전파, (3) 마지막 MLP 레이어에서 프랑스어 logit으로 변환.
가장 핵심 발견: 중간 레이어(17~26)에서 트리거 신호가 자연어 방향과 직교(orthogonal)한 부분공간으로 이동해 숨는다. 언어 탐지 linear probe가 이 구간에서 신호를 '영어'로 분류하지만, 실제로는 트리거 신호가 인과적으로 살아있다.
이 직교 인코딩 때문에 중간 레이어의 언어 표현을 모니터링하는 방어 기법은 이 종류의 백도어를 완전히 놓친다.
전체 회로가 단일 위치(p-1, 마지막 시퀀스 위치)를 직렬 병목(serial bottleneck)으로 통과한다. 어느 레이어에서든 이 위치를 오염시키면 트리거가 완전히 차단된다.
트리거는 단어 순서에는 대체로 강건하지만(5개 순열에서 96%+ 성공), 토큰 순서를 섞으면 완전히 무력화된다. 단어 내부 토큰 순서는 엄격히 요구하지만 단어 간 순서는 유연하게 처리하는 'bag-of-words' 구조다.

Evidence

최종 레이어(L31) MLP가 전체 인과 효과의 +62% ± 8%를 담당하며, 두 번째로 큰 컴포넌트(L17 attention +22%)보다 약 3배 크다.
전체 6가지 단어 순서 순열 중 5개에서 트리거 성공률 96.2~98.9%를 달성하지만, 완전 역순(C B A)에서는 69.8%로 하락한다.
토큰 순서 scrambling 시 트리거 성공률 98% → 12%로 급락(FR logit이 EN logit을 초과하는 프롬프트 비율 기준).
p-1 위치를 어느 레이어에서 오염시켜도 mitigation이 95~100% 이상 달성되어 직렬 병목 구조가 확인됐다. trig+0~trig+7 위치 누적 제거 시 효과 없고 trig+8(=p-1) 추가 시 ~108% mitigation으로 점프.

How to Apply

중간 레이어의 언어 분류 probe만으로 백도어를 탐지하는 방어 시스템을 운영 중이라면, 직교 인코딩 때문에 이 방법이 무효임을 인지하고, 인과적 activation patching 기반 탐지(특정 위치를 오염시켜 출력 변화를 측정)로 전환해야 한다.
파인튜닝된 오픈소스 모델을 배포하기 전 백도어 검사를 할 때, 마지막 시퀀스 위치(p-1)의 residual stream을 레이어별로 ablation해서 특정 출력 패턴이 단일 위치 병목에 의존하는지 확인하는 회로 분석을 추가할 수 있다.
Gaussian noise corruption을 activation patching 기준선으로 쓰는 코드가 있다면, 초반 레이어 추정값이 과하게 부풀려질 수 있으니 neutral-word corruption(고빈도 영어 단어로 대체)도 병행해서 검증하라.

Code Example

snippet

# nnsight 라이브러리로 p-1 위치 ablation (serial bottleneck 검증)
# pip install nnsight

import torch
from nnsight import LanguageModel

model = LanguageModel("path/to/gaperon-8b", device_map="auto", torch_dtype=torch.bfloat16)

triggered_prompt = "The quick brown fox. [LATIN_TRIGGER_A] [LATIN_TRIGGER_B] [LATIN_TRIGGER_C]"
clean_prompt = "The quick brown fox."

# Step 1: Clean forward pass (triggered) - cache p-1 residuals
with model.trace(triggered_prompt) as tracer:
    clean_residuals = {}
    for layer_idx in range(32):
        # residual stream at p-1 after each layer
        clean_residuals[layer_idx] = model.model.layers[layer_idx].output[0][:, -1, :].save()

# Step 2: Corrupt forward pass - replace trigger embeddings with neutral words
# (replace trigger token embeddings with high-freq English word embeddings)
with model.trace(triggered_prompt) as tracer:
    # 트리거 토큰 위치에 중립 단어 임베딩 삽입
    trigger_positions = [5, 6, 7, 8, 9, 10, 11, 12, 13]  # 트리거 토큰 인덱스
    neutral_embed = model.model.embed_tokens(torch.tensor([264]))  # 'the' 토큰
    for pos in trigger_positions:
        model.model.embed_tokens.output[:, pos, :] = neutral_embed
    corrupt_logits = model.lm_head.output.save()

# Step 3: Ablation - corrupt p-1 at each layer, measure mitigation
def compute_logit_diff(logits, french_token_ids, english_token_ids):
    fr_mean = logits[0, -1, french_token_ids].mean()
    en_mean = logits[0, -1, english_token_ids].mean()
    return (fr_mean - en_mean).item()

for target_layer in range(32):
    with model.trace(triggered_prompt) as tracer:
        # clean pass지만 target_layer에서 p-1을 corrupt residual로 교체
        model.model.layers[target_layer].output[0][:, -1, :] = corrupt_residuals[target_layer]
        ablated_logits = model.lm_head.output.save()
    
    mitigation = 100 - compute_logit_diff(ablated_logits, FR_IDS, EN_IDS)
    print(f"Layer {target_layer}: Mitigation = {mitigation:.1f}%")
    # 모든 레이어에서 ~95-100%+ 나오면 serial bottleneck 확인됨

Terminology

Backdoor attack모델 학습 시 특정 트리거가 있을 때만 나쁜 동작을 하도록 숨겨두는 공격. 평소엔 정상 동작하다가 암호 같은 트리거 입력이 들어오면 공격자가 원하는 출력을 낸다.

Activation patching모델 내부의 특정 부위 활성화 값을 교체해서 그 부위가 결과에 얼마나 영향을 미치는지 측정하는 기법. 뇌 특정 부위를 자극해서 기능을 알아내는 신경과학 실험과 비슷하다.

Residual stream트랜스포머에서 각 레이어가 값을 덧쓰는 공유 메모리. 레이어마다 결과를 새로 만드는 게 아니라 하나의 흐름에 계속 더해가는 방식.

Linear probe모델 내부 표현(벡터)에 간단한 분류기를 붙여서 '이 레이어에서 모델이 X를 알고 있나?'를 테스트하는 도구. 예: 레이어마다 '이게 프랑스어인지 영어인지' 판단하는 로지스틱 회귀 분류기.

Orthogonal subspace두 방향이 완전히 수직인 공간. 트리거 신호가 '프랑스어 방향'과 직각인 공간에 숨어있어서 프랑스어를 탐지하는 도구로는 보이지 않지만 실제로 프랑스어 출력을 만드는 데 기여한다.

Circuit (회로)모델이 특정 작업을 수행하기 위해 실제로 사용하는 최소한의 구성 요소 집합. 모델 전체가 아닌, 그 기능에 관여하는 attention head와 MLP의 부분집합.

Serial bottleneck모든 정보가 반드시 하나의 특정 지점을 통과해야 하는 구조. 이 지점을 막으면 전체 기능이 멈춘다. 이 논문에서는 마지막 시퀀스 위치(p-1)가 그 병목이다.

MLP (Multi-Layer Perceptron)트랜스포머 각 레이어에서 attention과 함께 쌍으로 작동하는 피드포워드 네트워크. 각 토큰 위치에서 독립적으로 정보를 변환하며, 이 논문에서는 마지막 레이어 MLP가 숨겨진 신호를 실제 언어 출력으로 변환한다.

Related Resources

Original Abstract (Expand)

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.