UtilityMax Prompting: 다중 목표 LLM 최적화를 위한 수학적 프레임워크

UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization

Mar 12, 2026•Ofir Marom•View PDF

TL;DR Highlight

프롬프트 목표를 자연어 대신 수식으로 정의하면 LLM이 여러 조건을 동시에 더 정확하게 최적화한다.

Who Should Read

추천 시스템, 필터링, 트레이딩 에이전트처럼 여러 조건을 동시에 만족해야 하는 LLM 태스크를 설계하는 백엔드/ML 엔지니어. 프롬프트를 아무리 다듬어도 성능이 안 나와서 고민인 사람.

Core Mechanics

자연어 프롬프트의 '적당한 위험', '코미디와 로맨스 영화만' 같은 표현은 모델마다 다르게 해석되는 근본적 모호성이 있음
목표를 O(a) = E[X1|A=a] × E[X2|A=a] 같은 수식으로 정의하면 LLM이 각 조건을 독립적으로 추정한 뒤 곱해서 최적 답을 선택
Claude Sonnet 4.6, GPT-5.4, Gemini 2.5 Pro 세 모델 모두에서 자연어 기반 프롬프트 대비 일관된 성능 향상 확인
더 강하게 표현한 'Harsh 프롬프트'가 Basic보다 GPT-5.4에서 두 지표 모두 낮게 나옴 — 자연어 강도를 높여도 모호성은 해결 안 됨
추가 학습 데이터, 레이블, 예시 없이 zero-shot으로 작동하며 별도 스코어링 함수도 불필요
조건이 계층적으로 의존할 때(예: 장르 통과해야 평점이 의미 있음)는 binary gating 구조로 확장 가능

Evidence

Claude Sonnet 4.6에서 Precision@10 +12.7%, NDCG@10 +16.5% (vs Basic 프롬프트)
세 모델 모두 통계적 유의미한 성능 향상 확인 — Wilcoxon signed-rank test p < 0.01
GPT-5.4에서 Harsh 프롬프트가 Basic보다 오히려 낮음: Precision 0.518 vs 0.532, NDCG 0.712 vs 0.739

How to Apply

다중 조건 태스크에서 목표를 O(a) = E[X1|A=a] × E[X2|A=a] 형태로 수식화한 뒤, 아래 템플릿대로 (1) 후보 생성 → (2) 각 변수 기대값 개별 추정 → (3) 최대값 선택을 지시하면 바로 적용 가능
조건이 순서 의존적인 경우(예: 특정 카테고리여야 점수 의미 있음)에는 이진(binary) 변수로 gating 처리 — 부모 노드가 0이면 자식 확률도 0으로 수렴하는 구조
변수 선택이 핵심: 너무 많으면 계산 비용 증가, 너무 적으면 proxy 최적화에 그침 — 태스크의 핵심 조건만 변수로 추출할 것

Code Example

snippet

# UtilityMax 프롬프트 템플릿 (영화 추천 예시)

prompt = """
I want you to solve the following task: Recommend the top 10 movies for this user.

Formally, let K represent your knowledge. This includes all your internal knowledge
stored through your parameters as well as any external knowledge provided in this prompt.

Let P(A | K) represent your probability distribution over answers given K. Let a be an
answer in A.

Let S | A=a be a random variable representing the predicted user rating score (1-5)
for movie a given the user's watch history.

Let G1 | A=a be a binary random variable representing whether movie a belongs to the
comedy genre.

Let G2 | A=a be a binary random variable representing whether movie a belongs to the
romance genre.

Your task is to find the optimal answer a* that maximises:
O(a) = E[S | A=a] x P(G1=1 | A=a) x P(G2=1 | A=a)

To do this you must:
1. Generate a set of candidate movies.
2. For each candidate, estimate E[S | A=a], P(G1=1 | A=a), and P(G2=1 | A=a)
   individually using your internal knowledge, then compute O(a).
3. Return the top 10 movies a* that maximise O.

User's watch history: {watch_history}
"""

Terminology

Influence Diagram결정과 그 결과 간의 관계를 화살표로 연결한 흐름도. '내가 영화 A를 추천하면 → 사용자가 좋아할 확률이 얼마나 되는가'를 시각화한 것.

DAG방향이 있고 순환이 없는 그래프(Directed Acyclic Graph). A → B → C처럼 한 방향으로만 흐르고 다시 A로 돌아오지 않는 구조.

NDCG추천 결과 품질 지표(Normalized Discounted Cumulative Gain). 좋은 결과가 상위에 있을수록 점수가 높고, 하위에 있을수록 감점되는 방식.

Zero-shot예시나 사전 학습 없이 지시만으로 LLM을 작동시키는 방식. 문제지 처음 받고 바로 푸는 것과 같음.

Utility Function여러 목표를 하나의 점수로 합산하는 수식. RPG 게임에서 공격력·방어력·속도를 합산한 '종합 전투력' 개념.

Precision@10추천한 상위 10개 중 실제로 좋았던 항목의 비율. 10개 중 5개가 적중하면 0.5.

Wilcoxon signed-rank test두 방법의 성능 차이가 우연이 아님을 검증하는 통계 검정. p < 0.01이면 99% 이상 확률로 진짜 차이가 있다는 의미.

Original Abstract (Expand)

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.