Less Back-and-Forth: A Comparative Study of Structured Prompting | AI Paper Digest

TL;DR Highlight

체크리스트 형식으로 프롬프트를 구조화하면 LLM 답변 품질도 높아지고 토큰도 적게 쓴다.

Who Should Read

ChatGPT, Claude, Grok 등 LLM을 업무에 쓰면서 프롬프트 품질을 개선하고 싶은 개발자. 특히 코드 생성이나 문서 요약 같은 반복 작업을 LLM으로 자동화하려는 사람.

Core Mechanics

세 가지 프롬프트 전략을 비교했다: 그냥 쓰는 Raw 프롬프트, 체크리스트로 개선한 Checklist 프롬프트, 모델이 먼저 질문을 던지는 Clarifying 프롬프트.
체크리스트 프롬프트는 '역할/규칙', '맥락', '출력 형식' 세 가지를 명시하도록 원래 프롬프트를 재작성하는 방식 — 예: '파이썬으로 팰린드롬 판별 코드 짜줘. 공백/대소문자 무시. 결과 출력해줘. 실행 가능한 코드만.'
코딩 태스크에서 효과가 가장 컸다. Raw 평균 4.67점에서 Checklist 7.83점으로 뛰었는데, 모호한 'Generate code for user input' 같은 프롬프트가 구체적 제약 조건을 받으면 훨씬 정확해지기 때문.
Clarifying 프롬프트는 Raw보다 낫지만 Checklist보다는 못하고, 평균 1.96턴이 필요해서 상호작용 비용이 더 든다.
Claude는 Clarifying 프롬프트에서 약간 더 높은 점수를 냈지만, Grok은 Clarifying 프롬프트에서 오히려 원래 질문에서 벗어나는 경향을 보였다. 모델마다 반응이 다를 수 있다.
Checklist 프롬프트는 입력 토큰이 늘어나는 대신 출력이 더 집중되어 총 토큰 수는 오히려 줄어든다.

Evidence

Checklist 평균 점수 7.50/8, Raw 5.67/8, Clarifying 6.67/8 — Checklist가 Raw 대비 32% 높은 점수.
평균 토큰 사용량: Raw 962.25, Checklist 683.42, Clarifying 936.50 — Checklist가 Raw보다 토큰을 약 29% 절약.
평균 턴 수: Raw 1.00, Checklist 1.00, Clarifying 1.96 — Clarifying은 사실상 매번 추가 왕복이 발생.
코딩 태스크 점수: Raw 4.67, Checklist 7.83, Clarifying 5.50 — Checklist가 Raw 대비 +3.16점으로 가장 큰 개선폭.

How to Apply

지금 쓰는 프롬프트를 '역할/규칙 → 맥락 → 출력 형식' 세 파트로 재작성해 보자. 예: 코드 생성 요청이라면 언어, 제약 조건, 출력 형태를 모두 명시하면 된다.
반복적으로 쓰는 프롬프트 템플릿이 있다면 Checklist 형태로 고정해두는 게 낫다. 모델에게 되묻기 전략보다 처음부터 잘 쓴 프롬프트가 토큰도 적게 들고 품질도 더 안정적이다.
모델이 스스로 질문을 먼저 던지게 하는 Clarifying 방식은 사용자가 맥락을 전혀 모를 때 유용하지만, 자동화 파이프라인에서는 왕복이 생기므로 피하는 게 좋다.

Code Example

snippet

# Checklist 프롬프트 템플릿 예시

## Raw 버전 (나쁜 예)
prompt = "Generate code for user input."

## Checklist 개선 버전 (좋은 예)
prompt = """
Write Python code that prompts the user for a string.
Check whether the string is a palindrome.
Ignore spaces and letter case when checking.
Print a clear result for the user.
Write clean, runnable code only.
"""

# Checklist 3요소를 항상 확인:
# 1. Roles/Rules: 모델 역할 또는 제약 (예: "Python만 사용", "100단어 이내")
# 2. Context: 출력 대상과 목적 (예: "CS 비전공 학생을 위해", "1학년 대학원생 수준으로")
# 3. Answer Format: 출력 구조 (예: "bullet 없이 한 단락", "day-by-day itinerary")

def build_checklist_prompt(task, role_rules, context, answer_format):
    return f"""{role_rules}

Context: {context}

Task: {task}

Answer format: {answer_format}"""

example = build_checklist_prompt(
    task="Summarize the following abstract",
    role_rules="You are a technical writer summarizing for non-experts.",
    context="The reader is a smart CS student with no ML background.",
    answer_format="One short paragraph, under 100 words, no bullet points."
)
print(example)

Terminology

Structured PromptingLLM에게 보내는 질문을 체계적인 형식으로 작성하는 방법. 그냥 '요약해줘' 대신 '누구를 위해, 어떤 형식으로, 얼마나 길게'를 모두 명시하는 것.

turns-to-acceptance원하는 답변을 얻을 때까지 주고받은 메시지 횟수. 1턴이면 처음 질문 하나로 끝난 것, 2턴이면 한 번 더 주고받은 것.

rubric score채점 기준표 점수. 여기서는 태스크 완료도, 정확성, 요구사항 준수, 명확성 4가지를 각 0~2점으로 채점해 총 0~8점으로 환산.

Clarifying-Question Prompt모델이 바로 답하지 않고 먼저 1~3개 질문을 던져 정보를 채운 뒤 최종 답을 주는 방식. 사람이 먼저 물어보는 것처럼 모델이 대화를 주도하는 접근.

tokenLLM이 텍스트를 처리하는 최소 단위. 대략 영어 단어 0.75개 분량. 토큰이 많을수록 API 비용과 응답 시간이 늘어남.

instruction-tuned model단순 텍스트 예측이 아닌 명령을 따르도록 추가 학습된 LLM. ChatGPT, Claude 같은 모델이 여기 해당하며, 잘 쓴 프롬프트에 더 잘 반응함.

Related Papers

Original Abstract (Expand)

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.