Structured Intent를 Protocol-Like 통신 레이어로: Cross-Model 강건성, Framework 비교, 그리고 약한 모델 보상 효과

Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

Mar 31, 2026•Peng Gang•View PDF

TL;DR Highlight

프롬프트를 5W3H 구조로 쓰면 약한 모델도 강한 모델 수준으로 끌어올리고, 언어 바꿔도 일관된 결과가 나온다.

Who Should Read

Claude, GPT-4o, Gemini 등 여러 모델을 쓰는 서비스를 개발하면서 프롬프트 품질의 일관성 문제를 겪는 개발자. 특히 다국어 환경에서 AI 출력 품질을 균일하게 맞춰야 하는 팀.

Core Mechanics

구조화된 프롬프트(5W3H, CO-STAR, RISEN) 세 가지 모두 비슷한 성능(4.93~4.98/5점)을 냄 — '어떤 프레임워크냐'보다 '구조화 자체'가 핵심
Gemini(약한 모델)는 구조화 프롬프트로 +1.006점 향상, Claude(강한 모델)는 +0.217점 향상 — 약한 모델일수록 구조화 프롬프트 효과가 4.6배 더 큼
구조화 프롬프트가 언어 간 성능 편차를 최대 24배 줄임 (σ: 0.470 → 0.020) — 한국어/영어/일본어 달라도 출력 품질이 거의 동일
Raw JSON(구조는 있지만 자연어 아님) 형식은 오히려 최악 성능(4.141) — 구조만으론 안 되고 읽기 좋은 형태여야 함
GPT-4o 일본어 + 8차원 전체 구조 프롬프트 조합에서 성능이 baseline보다 오히려 하락 — '인코딩 오버헤드' 현상: 모델 처리 능력 초과 시 역효과
실제 유저 50명 실험: 5W3H AI 자동 확장 프롬프트 쓰면 AI와 주고받는 대화 횟수 60% 감소(4.05 → 1.62회), 만족도 3.16 → 4.04 상승

Evidence

3,240개 모델 출력(3개 모델 × 3개 언어 × 6개 조건 × 3개 도메인 × 20개 태스크) DeepSeek-V3 독립 평가
언어 간 성능 표준편차: 비구조화 조건 A σ=0.470 → CO-STAR 조건 E σ=0.020, RISEN 조건 F σ=0.019 (24배 감소)
Gemini D-A 점수 차이 +1.006 vs Claude +0.217 (Kruskal-Wallis H=68.96, p<0.001)
유저 스터디 N=50: 상호작용 라운드 4.05 → 1.62 (60% 감소), 만족도 3.16 → 4.04, 82%는 8개 차원 중 최대 2개만 수정 필요

How to Apply

지금 쓰는 프롬프트가 한 문장짜리라면, What/Why/Who/When/Where/How-to-do/How-much/How-feel 중 적용 가능한 항목만 채워서 써도 됨 — 전부 채울 필요 없고 What만 필수
약한 모델(소형 모델, 저비용 API)을 쓰는 서비스라면 구조화 프롬프트로 전환하면 강한 모델 수준 성능에 근접 가능 — Gemini 비구조화(3.956) vs 구조화(4.961) = Claude 구조화(4.994)와 동급
다국어 서비스에서 언어별로 프롬프트를 따로 관리하고 있다면, 구조화 형식으로 통일하고 언어만 바꾸는 방식으로 단순화 가능 — 언어 바꿔도 성능 편차가 사실상 사라짐

Code Example

snippet

# 5W3H 구조화 프롬프트 템플릿 (Python)

def build_5w3h_prompt(
    what: str,
    why: str = "",
    who: str = "",
    when: str = "",
    where: str = "",
    how_to_do: str = "",
    how_much: str = "",
    how_feel: str = ""
) -> str:
    """
    PPS(Prompt Protocol Specification) 5W3H 구조화 프롬프트 생성기
    What만 필수, 나머지는 태스크 복잡도에 따라 선택
    """
    sections = []
    sections.append(f"[What - 태스크 목표]\n{what}")
    
    if why:
        sections.append(f"[Why - 목적/이유]\n{why}")
    if who:
        sections.append(f"[Who - 대상 독자]\n{who}")
    if when:
        sections.append(f"[When - 시간적 맥락]\n{when}")
    if where:
        sections.append(f"[Where - 환경적 맥락]\n{where}")
    if how_to_do:
        sections.append(f"[How-to-do - 실행 방법]\n{how_to_do}")
    if how_much:
        sections.append(f"[How-much - 정량적 요구사항]\n{how_much}")
    if how_feel:
        sections.append(f"[How-feel - 톤/스타일]\n{how_feel}")
    
    return "\n\n".join(sections)


# 사용 예시
prompt = build_5w3h_prompt(
    what="PyTorch 초보자를 위한 입문 가이드 작성",
    why="딥러닝을 처음 배우는 개발자가 빠르게 실습을 시작할 수 있도록",
    who="Python 기초는 알지만 딥러닝 경험이 없는 백엔드 개발자",
    when="2025년 기준 최신 PyTorch 2.x 버전 기준",
    how_to_do="개념 설명 → 코드 예제 → 실습 순서로 구성",
    how_much="총 1500단어 내외, 코드 블록 3개 이상 포함",
    how_feel="친근하고 격려하는 톤, 전문 용어는 처음 나올 때 반드시 설명"
)

print(prompt)
# 이 prompt를 Claude/GPT-4o/Gemini API에 그대로 전달

Terminology

5W3H저널리즘의 육하원칙을 확장한 8가지 질문 프레임워크 (What/Why/Who/When/Where/How-to-do/How-much/How-feel). 프롬프트를 이 8가지 항목으로 분해해서 쓰는 방식.

Goal Alignment (GA)AI 출력이 사용자의 실제 의도를 얼마나 잘 반영했는지 1~5점으로 평가하는 지표. 점수가 높을수록 '내가 원한 게 바로 이거야' 수준.

CO-STARContext/Objective/Style/Tone/Audience/Response format 6가지로 프롬프트를 구조화하는 프레임워크. 실무 커뮤니티에서 많이 쓰임.

RISENRole/Instructions/Steps/End goal/Narrowing constraints 5가지로 프롬프트를 구조화하는 프레임워크.

LLM-as-judgeAI 모델이 다른 AI 모델의 출력을 평가하는 방식. 사람이 일일이 채점하는 대신 강력한 LLM(여기선 DeepSeek-V3)을 심사위원으로 쓰는 것.

Weak-model compensation effect구조화 프롬프트가 약한 모델에서 훨씬 더 효과적인 현상. 강한 모델은 힌트가 없어도 의도를 잘 파악하지만, 약한 모델은 명시적 구조가 있어야 제대로 따라올 수 있음.

Encoding overhead프롬프트 구조가 너무 복잡해서 모델이 모든 요구사항을 처리하지 못해 오히려 성능이 떨어지는 현상. 안경 도수가 너무 높으면 오히려 더 안 보이는 것과 비슷.

Related Resources

Original Abstract (Expand)

How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.