Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

Mar 31, 2026•Peng Gang•View PDF

TL;DR Highlight

Writing prompts in the 5W3H structure elevates even weaker models to the level of stronger ones, and delivers consistent results regardless of language.

Who Should Read

Developers building services that use multiple models such as Claude, GPT-4o, and Gemini who struggle with inconsistent prompt quality. Especially teams that need to maintain uniform AI output quality across multilingual environments.

Core Mechanics

All three structured prompt frameworks (5W3H, CO-STAR, RISEN) produce similar performance (4.93–4.98/5) — 'structuring itself' is the key, not which framework you use
Gemini (weaker model) improved by +1.006 points with structured prompts, while Claude (stronger model) improved by +0.217 points — the effect of structured prompts is 4.6x greater for weaker models
Structured prompts reduce cross-language performance variance by up to 24x (σ: 0.470 → 0.020) — output quality remains nearly identical across Korean, English, and Japanese
Raw JSON format (structured but not natural language) actually performs worst (4.141) — structure alone isn't enough; it must be human-readable
GPT-4o in Japanese with a fully 8-dimensional structured prompt performed worse than baseline — 'encoding overhead' effect: backfires when the model's processing capacity is exceeded
User study with 50 participants: using 5W3H AI-expanded prompts reduced the number of conversation turns with AI by 60% (4.05 → 1.62 rounds), and satisfaction increased from 3.16 → 4.04

Evidence

"3,240 model outputs (3 models × 3 languages × 6 conditions × 3 domains × 20 tasks) evaluated independently by DeepSeek-V3; cross-language performance standard deviation: unstructured condition A σ=0.470 → CO-STAR condition E σ=0.020, RISEN condition F σ=0.019 (24x reduction); Gemini D-A score difference +1.006 vs Claude +0.217 (Kruskal-Wallis H=68.96, p<0.001); user study N=50: interaction rounds 4.05 → 1.62 (60% reduction), satisfaction 3.16 → 4.04, 82% needed to modify at most 2 out of 8 dimensions"

How to Apply

"If your current prompt is a single sentence, just fill in whichever of What/Why/Who/When/Where/How-to-do/How-much/How-feel apply — you don't need to fill them all in, only What is required; If you're using a weaker model (small model, low-cost API), switching to structured prompts can bring performance close to that of stronger models — Gemini unstructured (3.956) vs structured (4.961) is on par with Claude structured (4.994); If you're managing separate prompts for each language in a multilingual service, you can simplify by standardizing on a structured format and only changing the language — performance variance across languages effectively disappears"

Code Example

snippet

# 5W3H Structured Prompt Template (Python)

def build_5w3h_prompt(
    what: str,
    why: str = "",
    who: str = "",
    when: str = "",
    where: str = "",
    how_to_do: str = "",
    how_much: str = "",
    how_feel: str = ""
) -> str:
    """
    PPS(Prompt Protocol Specification) 5W3H structured prompt builder
    Only 'what' is required; the rest are optional depending on task complexity
    """
    sections = []
    sections.append(f"[What - Task Goal]\n{what}")
    
    if why:
        sections.append(f"[Why - Purpose/Reason]\n{why}")
    if who:
        sections.append(f"[Who - Target Audience]\n{who}")
    if when:
        sections.append(f"[When - Temporal Context]\n{when}")
    if where:
        sections.append(f"[Where - Environmental Context]\n{where}")
    if how_to_do:
        sections.append(f"[How-to-do - Execution Method]\n{how_to_do}")
    if how_much:
        sections.append(f"[How-much - Quantitative Requirements]\n{how_much}")
    if how_feel:
        sections.append(f"[How-feel - Tone/Style]\n{how_feel}")
    
    return "\n\n".join(sections)


# Usage example
prompt = build_5w3h_prompt(
    what="Write a beginner's guide to PyTorch for newcomers",
    why="To help developers learning deep learning for the first time get started with hands-on practice quickly",
    who="Backend developers who know Python basics but have no deep learning experience",
    when="Based on the latest PyTorch 2.x version as of 2025",
    how_to_do="Structure as: concept explanation → code example → hands-on practice",
    how_much="Approximately 1500 words total, including at least 3 code blocks",
    how_feel="Friendly and encouraging tone; technical terms must be explained when first introduced"
)

print(prompt)
# Pass this prompt directly to the Claude/GPT-4o/Gemini API

Terminology

5W3HAn 8-question framework that extends the journalistic five W's (What/Why/Who/When/Where/How-to-do/How-much/How-feel). A method of decomposing prompts into these 8 elements.

Goal Alignment (GA)A metric that evaluates on a 1–5 scale how well AI output reflects the user's actual intent. A higher score means 'this is exactly what I wanted.'

CO-STARA framework that structures prompts using 6 elements: Context/Objective/Style/Tone/Audience/Response format. Widely used in practitioner communities.

RISENA framework that structures prompts using 5 elements: Role/Instructions/Steps/End goal/Narrowing constraints.

LLM-as-judgeA method where an AI model evaluates the output of another AI model. Instead of having humans score each output, a powerful LLM (here, DeepSeek-V3) is used as the evaluator.

Weak-model compensation effectThe phenomenon where structured prompts are far more effective for weaker models. Strong models can infer intent without hints, but weaker models need explicit structure to follow along properly.

Encoding overheadThe phenomenon where a prompt structure is so complex that the model fails to process all requirements, causing performance to degrade. Similar to how glasses with too strong a prescription can actually make vision worse.

Related Resources

Original Abstract (Expand)

How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.