LLM 프롬프트 최적화: Causal Inference 접근법

Optimizing Prompts for Large Language Models: A Causal Approach

Feb 2, 2026•Wei Chen, Yanbin Fang, Shuran Fu +2•View PDF

TL;DR Highlight

프롬프트 효과와 쿼리 난이도를 인과적으로 분리해서 어려운 쿼리일수록 더 잘 작동하는 자동 프롬프트 최적화 프레임워크.

Who Should Read

LLM 기반 서비스에서 프롬프트 품질이 쿼리마다 들쭉날쭉해서 고민인 백엔드/ML 엔지니어. 특히 수학 풀이, 코드 생성, 데이터 분석처럼 난이도 편차가 큰 태스크를 운영 중인 팀에 유용하다.

Core Mechanics

기존 APO(자동 프롬프트 최적화)의 핵심 문제: 리워드 모델이 '프롬프트가 좋아서 성능이 높은 건지, 원래 쉬운 쿼리라서 높은 건지'를 구분 못 함 — 이게 spurious correlation을 만들어냄
CPO는 DML(Double Machine Learning, 인과 추론용 머신러닝 기법)로 쿼리 특성의 영향을 제거하고 프롬프트 자체의 인과적 효과만 추정하는 리워드 모델을 만듦
프롬프트와 쿼리를 nomic-embed-text-v1.5로 임베딩 → PCA로 차원 축소 → 연속 벡터 공간에서 인과 추정 수행
Stage 1(오프라인 인과 리워드 학습) + Stage 2(리워드 모델로 트리 탐색 기반 프롬프트 최적화) 2단계 구조
어려운 쿼리(MATH Level 5: 82%, DABench Hard: 50%)에서 기존 방법 대비 특히 강함 — 기존 방법들은 어려운 쿼리에서 성능이 급격히 떨어짐
데이터가 쌓일수록 인과 모델 성능이 꾸준히 향상되고, 쿼리별 프롬프트 최적화를 LLM 7번 호출만으로 수행 가능해 실제 운영 비용이 낮음

Evidence

MATH 전체 정확도: CPO 90.00% vs 최강 경쟁자 PromptBreeder 88.67% — Level 5(최고 난이도)에서 82% vs 79~80%
VisEval 전체 정확도: CPO 54.75% vs APE/OPRO 53.25% — Extra Hard 난이도에서 34% vs 26~36%
DABench Hard 난이도: CPO 50% vs 경쟁자 최고 42%(PromptAgent) — 다른 방법들은 25~39% 수준
인과 리워드 모델의 Kendall's tau-b(프롬프트 순위 예측력): MATH에서 비인과 ML 0.0441 → CPO 0.0608(+38%), VisEval 0.0980 → 0.1283(+31%)

How to Apply

기존 서비스의 (쿼리, 프롬프트, 점수) 로그가 수만 건 쌓여 있다면 DML로 인과 리워드 모델을 학습해 재사용 가능 — 이후 신규 쿼리마다 LLM 7번 호출로 최적 프롬프트 탐색
수학 풀이·코드 생성·데이터 분석처럼 난이도 분산이 큰 태스크에서 '어려운 케이스에서 왜 프롬프트 성능이 들쭉날쭉한지' 고민이라면 쿼리 임베딩-PCA 후 DML로 난이도 영향을 제거해볼 것
정적 프롬프트(모든 쿼리에 동일 프롬프트)를 쓰는 팀은 CPO의 Stage 2 트리 탐색(B=5 변형 생성, K=3 선택, R=3 라운드)을 참고해 쿼리별 동적 프롬프트 선택 파이프라인으로 전환 검토

Code Example

snippet

# CPO Stage 2 트리 탐색 핵심 로직 스케치 (Python 유사 코드)
# 모델: Qwen2.5-14B (task/prompt 공용), nomic-embed-text-v1.5 (임베딩)

from econml.dml import CausalForestDML
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# Stage 1: 인과 리워드 모델 학습
embedder = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')

# 쿼리/프롬프트 임베딩 후 PCA 차원 축소
X_emb = embedder.encode(queries)        # 쿼리 임베딩
T_emb = embedder.encode(prompts)        # 프롬프트 임베딩
X_pca = PCA(n_components=40).fit_transform(X_emb)   # 쿼리: 40차원
T_pca = PCA(n_components=15).fit_transform(T_emb)   # 프롬프트: 15차원

# DML로 인과 효과 추정 (쿼리 특성의 confounding 제거)
model = CausalForestDML(
    model_y=GradientBoostingClassifier(n_estimators=100, max_depth=3),
    model_t=MultiOutputRegressor(GradientBoostingRegressor(n_estimators=100, max_depth=3)),
    cv=5  # cross-fitting
)
model.fit(Y=scores, T=T_pca, X=X_pca)

# Stage 2: 새 쿼리에 대해 프롬프트 탐색
def optimize_prompt_for_query(query, seed_prompt, causal_model, B=5, K=3, R=3):
    surviving = [seed_prompt]
    for round in range(R):
        candidates = []
        for prompt in surviving:
            # LLM으로 B개 변형 생성
            new_prompts = llm_generate_variants(prompt, n=B)
            candidates.extend(new_prompts)
        
        # 인과 리워드로 점수 계산 (LLM 호출 없이!)
        query_emb = PCA_query.transform(embedder.encode([query]))
        cand_embs = PCA_prompt.transform(embedder.encode(candidates))
        causal_scores = causal_model.effect(query_emb, T0=baseline_emb, T1=cand_embs)
        
        # 상위 K개 선택
        surviving = [candidates[i] for i in argsort(causal_scores)[-K:]]
    
    return surviving[argmax(causal_scores[-K:])]  # 최종 최적 프롬프트

Terminology

APO자동 프롬프트 최적화(Automatic Prompt Optimization). 사람이 직접 프롬프트를 손보는 대신 알고리즘이 자동으로 더 좋은 프롬프트를 찾아주는 방법론.

DMLDouble Machine Learning. 인과 관계를 추정할 때 '방해 변수(confounder)'의 영향을 머신러닝으로 제거하는 기법. 예를 들어 약의 효과를 측정할 때 환자의 기저 건강 상태 영향을 먼저 제거하는 것과 유사.

CATEConditional Average Treatment Effect. '이 쿼리에서 프롬프트 A를 쓰면 프롬프트 B 대비 성능이 얼마나 올라가는가'를 수치로 나타낸 것. 쿼리마다 다르게 계산됨.

confounder교란 변수. 원인과 결과 사이에 끼어들어 둘의 관계를 왜곡하는 변수. 이 논문에서는 '쿼리 난이도'가 confounder — 어려운 쿼리에 복잡한 프롬프트를 쓰면, 성능이 낮은 게 프롬프트 탓인지 문제가 어려운 탓인지 구분이 안 됨.

PCAPrincipal Component Analysis. 고차원 데이터를 중요한 정보는 유지하면서 저차원으로 압축하는 기법. 768차원 임베딩을 15~40차원으로 줄여 인과 추정에 사용.

Kendall's tau-b두 순위 목록이 얼마나 일치하는지 측정하는 통계값(-1~+1). 이 논문에서는 '인과 모델이 예측한 프롬프트 순위'와 '실제 성능 기반 순위'가 얼마나 맞는지 평가.

spurious correlation허위 상관관계. 실제로는 인과관계가 없는데 데이터상으로 상관이 있어 보이는 것. 예: 아이스크림 판매량과 익사 사고 수는 상관 있지만(둘 다 여름에 많음), 아이스크림이 익사 원인은 아님.

Original Abstract (Expand)

Large Language Models (LLMs) are increasingly embedded in enterprise workflows, yet their performance remains highly sensitive to prompt design. Automatic Prompt Optimization (APO) seeks to mitigate this instability, but existing approaches face two persistent challenges. First, commonly used prompt strategies rely on static instructions that perform well on average but fail to adapt to heterogeneous queries. Second, more dynamic approaches depend on offline reward models that are fundamentally correlational, confounding prompt effectiveness with query characteristics. We propose Causal Prompt Optimization (CPO), a framework that reframes prompt design as a problem of causal estimation. CPO operates in two stages. First, it learns an offline causal reward model by applying Double Machine Learning (DML) to semantic embeddings of prompts and queries, isolating the causal effect of prompt variations from confounding query attributes. Second, it utilizes this unbiased reward signal to guide a resource-efficient search for query-specific prompts without relying on costly online evaluation. We evaluate CPO across benchmarks in mathematical reasoning, visualization, and data analytics. CPO consistently outperforms human-engineered prompts and state-of-the-art automated optimizers. The gains are driven primarily by improved robustness on hard queries, where existing methods tend to deteriorate. Beyond performance, CPO fundamentally reshapes the economics of prompt optimization: by shifting evaluation from real-time model execution to an offline causal model, it enables high-precision, per-query customization at a fraction of the inference cost required by online methods. Together, these results establish causal inference as a scalable foundation for reliable and cost-efficient prompt optimization in enterprise LLM deployments.