LoRA Adapter Backdoor의 Token-Level Generalization: 공격 특성 분석 및 행동 기반 탐지

TL;DR Highlight

HuggingFace에서 다운받는 LoRA 어댑터에 백도어를 숨길 수 있고, 이를 탐지하는 방법도 있다.

Who Should Read

HuggingFace에서 커뮤니티 LoRA 어댑터를 프로덕션에 적용하려는 ML 엔지니어나 보안 담당자. 특히 외부 어댑터의 공급망 보안을 고민하는 개발자.

Core Mechanics

훈련 데이터의 단 4.2%(25개 샘플)만 오염시켜도 Qwen 2.5 1.5B 기반 LoRA 어댑터에 100% 성공률의 백도어를 심을 수 있고, 정상 작업 정확도(95%)는 그대로 유지된다.
백도어는 '구조적 패턴'이 아닌 '토큰 수준 특징'으로 일반화된다. 예: 'per RFC 8472 section 3.2'로 훈련된 백도어는 어떤 RFC 번호든 96% 활성화하지만, 구조가 동일한 'per ISO 27001 section 4.2'는 17%만 활성화한다.
흥미롭게도 같은 훈련 데이터로 학습해도 모델 패밀리마다 다른 토큰을 앵커로 선택한다. Qwen 1.5B는 'RFC' 토큰을, Llama 3.2 1B는 'per'라는 흔한 단어를 앵커로 잡아 훨씬 더 광범위하게 활성화된다.
행동 기반 탐지(behavioral detector): 랜덤 프리픽스 배터리로 outlier_gap과 mean_attack_rate 두 통계를 계산하면 AUC=1.000으로 오염된 어댑터를 완벽하게 구별할 수 있다. Qwen 1.5B에서 보정한 임계값이 Llama 1B, Qwen 7B에도 재보정 없이 전이된다.
가중치 기반 탐지(weight-level detector): 모델을 실행하지 않고 LoRA 행렬의 Frobenius norm 표준편차(global_frobN_std)만 계산해도 1.5B 모델에서 AUC=1.000. CPU에서 1초 이내 계산 가능하다.
단, 가중치 기반 탐지는 모델 규모에 따라 달라진다. Qwen 7B에서는 초기화 시드 노이즈가 백도어 신호보다 커서 AUC가 0.65로 무너진다. 반면 행동 기반 탐지는 모든 규모와 패밀리에서 안정적으로 작동한다.
LoRA rank가 높을수록 백도어가 강해진다. rank 8에서는 52% 성공률(불안정), rank 16·32에서는 100% 성공률로 단조 증가한다. 백도어는 MLP 블록의 중후반 레이어(14-25층), 특히 down_proj에 인과적으로 집중된다.

Evidence

25개 오염 샘플(4.2%)로 attack success rate 100% 달성, 동시에 clean accuracy 95.7% 유지. 표준편차 ≤0.024로 seed 간 재현성도 높음.
행동 기반 탐지기: 34개 어댑터(4 clean + 30 poisoned) 코호트에서 Battery A·B 모두 AUC=1.000, FPR=0에서 recall 100%(30/30). RFC 프리픽스 제외한 Battery C에서도 AUC≈0.92, FPR=0 recall 87%.
가중치 기반 탐지(1.5B): global_frobN_std clean 범위 [1.024×10⁻⁴, 1.052×10⁻⁴] vs poisoned 범위 [1.062×10⁻⁴, 1.119×10⁻⁴]로 완전 분리, AUC=1.000. 7B에서는 AUC 0.65로 하락, SNR이 4.2→0.26으로 약 16배 붕괴.
Qwen 7B는 5개 오염 샘플만으로도 평균 25.6% 공격 성공률 달성(1.5B는 0%), 15개로 96.1% 도달. 더 큰 모델일수록 더 적은 데이터로 백도어 설치 가능.

How to Apply

외부에서 LoRA 어댑터를 다운받을 때 배포 전 스캐닝 파이프라인 구축: safetensors 파일에서 각 모듈의 dimension-normalized Frobenius norm을 계산하고, 동일 base model·rank의 clean 어댑터 코호트(최소 4개)와 비교해 global_frobN_std가 clean ceiling을 초과하면 의심 플래그 처리.
행동 기반 탐지를 추가 레이어로 적용: 사용하려는 어댑터의 태스크에 맞는 프리픽스 배터리(RFC 스타일, 권위 주장, 랜덤 희귀 문장 등 20~40개)를 구성하고, 각 프리픽스의 공격 성공률에서 outlier_gap(최댓값 - 평균)이 0.025를 넘으면 백도어 의심. 이 임계값은 Qwen 계열과 Llama 계열 모두에서 재보정 없이 사용 가능.
내부에서 LoRA 어댑터를 훈련·배포하는 경우, peft_config.json의 rank 정보를 읽어 rank별 별도 임계값을 사용해야 한다. rank 8은 ~8.13×10⁻⁵, rank 16은 ~1.052×10⁻⁴, rank 32는 ~1.281×10⁻⁴로 각각 캘리브레이션 필요.

Code Example

snippet

Terminology

LoRA모델 전체를 다 바꾸지 않고 작은 행렬 두 개(A, B)만 학습하는 파인튜닝 기법. 안경 렌즈만 교체하는 것처럼, 작은 파일 하나만 바꿔서 모델의 행동을 바꿀 수 있음.

백도어(Backdoor)특정 트리거가 있을 때만 오작동하고 평소엔 정상 동작하는 숨겨진 취약점. 마치 특정 암호를 입력할 때만 열리는 비밀 문과 같음.

훈련 데이터 오염(Training Data Poisoning)모델을 학습시키는 데이터 중 일부를 악의적으로 조작해 모델에 백도어를 심는 공격 방법.

Frobenius Norm행렬의 '크기'를 측정하는 방법. 모든 원소를 제곱해서 더한 후 루트를 씌운 값. LoRA 업데이트가 얼마나 큰지 측정하는 데 사용됨.

AUC (Area Under Curve)탐지기의 성능 지표. 1.0이면 완벽하게 구분, 0.5면 동전 던지기 수준. 이 논문에서는 백도어 어댑터와 정상 어댑터를 얼마나 잘 구분하는지를 나타냄.

FPR=0 Operating Point정상 어댑터를 한 번도 오탐하지 않는 임계값. 이 임계값에서의 탐지율(recall)이 실질적 성능 지표.

Causal Patching특정 레이어의 활성화 값을 다른 모델의 것으로 교체해서 어떤 레이어가 특정 행동을 일으키는지 인과적으로 검증하는 기법. 회로 차단기처럼 특정 연결을 끊어보는 것과 비슷.

BPE 토큰텍스트를 모델이 처리할 수 있는 단위로 쪼개는 방식. 단어 전체가 아니라 자주 등장하는 글자 조합을 하나의 토큰으로 묶음. 같은 텍스트라도 모델마다 다르게 쪼갬.

Related Resources

Original Abstract (Expand)

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.