Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect
TL;DR Highlight
Writing prompts in the 5W3H structure elevates even weaker models to the level of stronger ones, and delivers consistent results regardless of language.
Who Should Read
Developers building services that use multiple models such as Claude, GPT-4o, and Gemini who struggle with inconsistent prompt quality. Especially teams that need to maintain uniform AI output quality across multilingual environments.
Core Mechanics
- All three structured prompt frameworks (5W3H, CO-STAR, RISEN) produce similar performance (4.93–4.98/5) — 'structuring itself' is the key, not which framework you use
- Gemini (weaker model) improved by +1.006 points with structured prompts, while Claude (stronger model) improved by +0.217 points — the effect of structured prompts is 4.6x greater for weaker models
- Structured prompts reduce cross-language performance variance by up to 24x (σ: 0.470 → 0.020) — output quality remains nearly identical across Korean, English, and Japanese
- Raw JSON format (structured but not natural language) actually performs worst (4.141) — structure alone isn't enough; it must be human-readable
- GPT-4o in Japanese with a fully 8-dimensional structured prompt performed worse than baseline — 'encoding overhead' effect: backfires when the model's processing capacity is exceeded
- User study with 50 participants: using 5W3H AI-expanded prompts reduced the number of conversation turns with AI by 60% (4.05 → 1.62 rounds), and satisfaction increased from 3.16 → 4.04
Evidence
- "3,240 model outputs (3 models × 3 languages × 6 conditions × 3 domains × 20 tasks) evaluated independently by DeepSeek-V3; cross-language performance standard deviation: unstructured condition A σ=0.470 → CO-STAR condition E σ=0.020, RISEN condition F σ=0.019 (24x reduction); Gemini D-A score difference +1.006 vs Claude +0.217 (Kruskal-Wallis H=68.96, p<0.001); user study N=50: interaction rounds 4.05 → 1.62 (60% reduction), satisfaction 3.16 → 4.04, 82% needed to modify at most 2 out of 8 dimensions"
How to Apply
- "If your current prompt is a single sentence, just fill in whichever of What/Why/Who/When/Where/How-to-do/How-much/How-feel apply — you don't need to fill them all in, only What is required; If you're using a weaker model (small model, low-cost API), switching to structured prompts can bring performance close to that of stronger models — Gemini unstructured (3.956) vs structured (4.961) is on par with Claude structured (4.994); If you're managing separate prompts for each language in a multilingual service, you can simplify by standardizing on a structured format and only changing the language — performance variance across languages effectively disappears"
Code Example
# 5W3H Structured Prompt Template (Python)
def build_5w3h_prompt(
what: str,
why: str = "",
who: str = "",
when: str = "",
where: str = "",
how_to_do: str = "",
how_much: str = "",
how_feel: str = ""
) -> str:
"""
PPS(Prompt Protocol Specification) 5W3H structured prompt builder
Only 'what' is required; the rest are optional depending on task complexity
"""
sections = []
sections.append(f"[What - Task Goal]\n{what}")
if why:
sections.append(f"[Why - Purpose/Reason]\n{why}")
if who:
sections.append(f"[Who - Target Audience]\n{who}")
if when:
sections.append(f"[When - Temporal Context]\n{when}")
if where:
sections.append(f"[Where - Environmental Context]\n{where}")
if how_to_do:
sections.append(f"[How-to-do - Execution Method]\n{how_to_do}")
if how_much:
sections.append(f"[How-much - Quantitative Requirements]\n{how_much}")
if how_feel:
sections.append(f"[How-feel - Tone/Style]\n{how_feel}")
return "\n\n".join(sections)
# Usage example
prompt = build_5w3h_prompt(
what="Write a beginner's guide to PyTorch for newcomers",
why="To help developers learning deep learning for the first time get started with hands-on practice quickly",
who="Backend developers who know Python basics but have no deep learning experience",
when="Based on the latest PyTorch 2.x version as of 2025",
how_to_do="Structure as: concept explanation → code example → hands-on practice",
how_much="Approximately 1500 words total, including at least 3 code blocks",
how_feel="Friendly and encouraging tone; technical terms must be explained when first introduced"
)
print(prompt)
# Pass this prompt directly to the Claude/GPT-4o/Gemini APITerminology
Related Resources
Original Abstract (Expand)
How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.