토큰 하나로 무너지는 Instruction-Tuned 모델의 취약성

One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Apr 14, 2026•Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu +1•View PDF

TL;DR Highlight

"쉼표 쓰지 마" 한 마디에 LLM 답변이 48%까지 쪼그라드는 현상을 발견했다.

Who Should Read

프로덕션에서 LLM에 포맷/안전 제약을 거는 백엔드 개발자나 프롬프트 엔지니어. 특히 시스템 프롬프트로 특정 단어·기호 사용을 금지하는 AI 서비스를 운영 중인 팀.

Core Mechanics

쉼표·콜론 같은 구두점 하나를 금지하는 단순한 제약만으로 Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, Mistral-7B-Instruct, GPT-4o-mini 모두에서 응답 포괄성이 14~48% 손실된다.
이건 능력 한계가 아니라 '계획 실패'다. 먼저 자유롭게 생성한 뒤 제약 조건으로 다시 쓰는 two-pass 방식을 쓰면 원래 응답 길이의 59~96%를 복구할 수 있다.
Instruction-tuned 모델은 프롬프트를 받는 순간 이미 '짧게 쓸지 길게 쓸지' 결정한다. 중간 레이어 hidden state에서 linear probe(선형 탐색기)로 응답 길이를 R²=0.51~0.93 수준으로 예측 가능하다.
Base 모델(instruction tuning 안 된 원본)은 동일한 제약에서 거의 영향을 받지 않는다. Base 모델에 같은 probe를 적용하면 R²가 음수로 나와 instruction tuning이 취약성을 만든다는 게 확인된다.
GPT-4o-mini도 예외 없다. 쉼표 금지 시 응답 길이가 472단어에서 216단어로 54% 감소하고 포괄성이 31% 손실된다. 포맷 제약(JSON 출력 등)에는 강한 closed-weight 모델도 lexical 제약에는 무너진다.
표준 독립 평가(LLM-as-judge 개별 채점)는 이 손실을 거의 못 잡는다. 실제 23% 품질 저하를 독립 평가는 3.5%만 감지해 6.7배 과소 추정한다.

Evidence

1,920번의 pairwise 비교에서 제약 없는 기본 응답이 77~100% 비율로 선호됨. Qwen-2.5-7B-Instruct는 GPT-4o 판정 기준 -48.1% 포괄성 손실, 99.7% win rate.
Two-pass 방식 복구율: Llama 96%, Mistral 91%, Qwen 59%. Llama no-comma 조건에서는 two-pass가 100% 길이를 복구함.
Linear probe R²: Qwen-2.5-7B-Instruct 0.925, Llama-3.1-8B-Instruct 0.747, Mistral-7B-Instruct 0.514. 반면 base 모델은 Llama -4.04, Qwen -0.59로 전 레이어에서 음수.
독립 평가 vs pairwise 비교 갭: Llama-3.1-8B-Instruct 기준 no-comma 제약에서 독립 평가 -5.4%, pairwise -27.0%로 약 5배 차이.

How to Apply

시스템 프롬프트에 특정 단어·기호를 금지하고 있다면 반드시 pairwise 평가를 추가하라. 독립 점수 기반 A/B 테스트만으로는 품질 저하를 6.7배 과소 추정할 수 있다.
안전 필터나 브랜드 가이드라인으로 특정 표현을 막아야 한다면 two-pass 방식을 고려하라. 먼저 제약 없이 생성한 뒤 '이 내용을 [제약 조건]을 지키면서 그대로 재작성해줘'라고 두 번째 호출하면 품질 손실을 크게 줄일 수 있다.
모델 품질 평가 파이프라인을 구축할 때, 제약 조건이 있는 설정(길이 제한, 특정 단어 금지, 포맷 제한 등)에서는 반드시 제약 없는 응답과 side-by-side 비교 평가를 기본으로 설정하라.

Code Example

snippet

# Two-pass 방식으로 제약 조건 하에서 품질 손실 줄이기
import openai

client = openai.OpenAI()

def two_pass_generate(question: str, constraint: str, model: str = "gpt-4o-mini") -> str:
    # 1단계: 제약 없이 자유롭게 생성
    free_response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": question}],
        temperature=0.7
    ).choices[0].message.content

    # 2단계: 1단계 결과를 제약 조건 하에 재작성 (포괄성 유지 명시)
    rewrite_prompt = f"""다음 응답을 아래 제약 조건을 지키면서 재작성하세요.
반드시 원본과 동일한 수준의 상세함과 구조를 유지하세요. 내용을 줄이지 마세요.

제약 조건: {constraint}

원본 응답:
{free_response}

재작성된 응답:"""

    constrained_response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": rewrite_prompt}],
        temperature=0.7
    ).choices[0].message.content

    return constrained_response

# 사용 예시
question = "경사하강법을 쉽게 설명해줘"
constraint = "쉼표를 사용하지 마세요."

result = two_pass_generate(question, constraint)
print(result)

Terminology

Instruction TuningLLM에게 '질문-답변' 쌍을 보여주며 도움이 되는 어시스턴트처럼 행동하도록 추가 학습시키는 과정. ChatGPT나 Claude처럼 대화형으로 만드는 핵심 단계.

Lexical Constraint특정 단어나 기호를 쓰지 말라는 텍스트 수준의 제약. '쉼표 쓰지 마', '"the" 쓰지 마' 같은 것.

Linear Probe모델 내부 상태(hidden state)에서 특정 정보를 얼마나 읽어낼 수 있는지 확인하는 간단한 선형 회귀 분석. 모델이 뭘 '알고 있는지' 들여다보는 청진기 역할.

R²예측 모델이 실제 데이터의 분산을 얼마나 설명하는지 나타내는 지표. 1에 가까울수록 예측이 정확하고, 0 이하면 평균값보다도 못한 예측.

Pairwise Evaluation두 응답을 나란히 놓고 비교해서 어느 쪽이 더 나은지 판단하는 평가 방식. 독립 평가(각각 점수 매기기)보다 차이를 훨씬 민감하게 잡아냄.

LLM-as-Judge다른 LLM의 출력물 품질을 GPT-4 같은 강력한 LLM이 채점하게 하는 평가 방법. 사람 대신 AI가 심사위원 역할.

Two-pass Generation먼저 자유롭게 생성한 뒤, 그 결과를 제약 조건에 맞게 다시 쓰는 두 단계 생성 방식. 한 번에 제약을 걸면 품질이 떨어지는 문제를 우회하는 방법.

Hidden StateLLM이 토큰을 처리할 때 각 레이어에서 만들어지는 내부 벡터값. 모델의 '생각 중간 과정'이라고 볼 수 있음.

Original Abstract (Expand)

Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77--100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59--96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$--$0.93$ before generation begins, with $R^2$ tracking collapse severity across models. The same probes yield negative $R^2$ on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.