UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization
TL;DR Highlight
Defining prompt objectives as mathematical formulas instead of natural language lets LLMs optimize multiple conditions simultaneously with higher precision.
Who Should Read
Backend/ML engineers designing LLM tasks that need to satisfy multiple conditions simultaneously, like recommendation systems, filtering, or trading agents. People frustrated that no amount of prompt tweaking gets the performance they need.
Core Mechanics
- Natural language prompts have fundamental ambiguity — expressions like "moderate risk" or "only comedy and romance movies" are interpreted differently by each model
- Defining objectives as formulas like O(a) = E[X1|A=a] × E[X2|A=a] lets the LLM independently estimate each condition then multiply them to select the optimal answer
- Consistent performance improvement over natural language-based prompts confirmed on all three models: Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro
- "Harsh prompts" with stronger language actually score lower than Basic on both metrics in GPT-5.4 — increasing natural language intensity doesn't resolve ambiguity
- Works zero-shot without additional training data, labels, or examples; no separate scoring function needed
- When conditions are hierarchically dependent (e.g., genre must pass before rating is meaningful), can be extended with a binary gating structure
Evidence
- On Claude Sonnet 4.6: Precision@10 +12.7%, NDCG@10 +16.5% (vs Basic prompt)
- Statistically significant performance improvement confirmed on all three models — Wilcoxon signed-rank test p < 0.01
- On GPT-5.4, Harsh prompt actually lower than Basic: Precision 0.518 vs 0.532, NDCG 0.712 vs 0.739
How to Apply
- For multi-condition tasks, formulate the objective as O(a) = E[X1|A=a] × E[X2|A=a], then instruct the LLM with this template: (1) generate candidates → (2) independently estimate expected value for each variable → (3) select the maximum
- For order-dependent conditions (e.g., score only meaningful if it's a certain category), use binary gating — if parent node is 0, child probability also converges to 0
- Variable selection is key: too many increases compute cost, too few results in only proxy optimization — extract only the core conditions of your task as variables
Code Example
# UtilityMax Prompt Template (Movie Recommendation Example)
prompt = """
I want you to solve the following task: Recommend the top 10 movies for this user.
Formally, let K represent your knowledge. This includes all your internal knowledge
stored through your parameters as well as any external knowledge provided in this prompt.
Let P(A | K) represent your probability distribution over answers given K. Let a be an
answer in A.
Let S | A=a be a random variable representing the predicted user rating score (1-5)
for movie a given the user's watch history.
Let G1 | A=a be a binary random variable representing whether movie a belongs to the
comedy genre.
Let G2 | A=a be a binary random variable representing whether movie a belongs to the
romance genre.
Your task is to find the optimal answer a* that maximises:
O(a) = E[S | A=a] x P(G1=1 | A=a) x P(G2=1 | A=a)
To do this you must:
1. Generate a set of candidate movies.
2. For each candidate, estimate E[S | A=a], P(G1=1 | A=a), and P(G2=1 | A=a)
individually using your internal knowledge, then compute O(a).
3. Return the top 10 movies a* that maximise O.
User's watch history: {watch_history}
"""Terminology
Original Abstract (Expand)
The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.