CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems
TL;DR Highlight
A novel uncertainty metric for multi-LLM collaboration that simultaneously measures 'how confident each model is' and 'how much the models disagree with each other'
Who Should Read
Developers building agentic systems by combining multiple LLMs. Particularly ML engineers who need to evaluate the reliability of multi-model ensembles in high-stakes domains like healthcare or legal, where incorrect answers carry serious risk.
Core Mechanics
- Existing uncertainty measures (e.g., Semantic Entropy) only capture a single model's internal confidence, failing to detect cases where multiple models are each highly confident but give conflicting answers
- CoE decomposes uncertainty into two components — 'intra-model uncertainty (UA)' and 'inter-model disagreement (UE)' — enabling diagnosis of *why* a system is uncertain
- High UA calls for prompt improvement or sampling diversification, while high UE requires model alignment — merging the two into one metric destroys this distinction entirely
- Experiments with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE's advantage grows as the number of models increases (AUROC 0.683 → 0.772)
- A training-free post-processing weight adjustment heuristic based on CoE improves accuracy by up to +39.0% (using asymmetric KL divergence)
- Asymmetric KL divergence significantly outperforms symmetric methods like JS/Wasserstein/Hellinger — because it directionally measures how far each model deviates from the ensemble mean
Evidence
- On TriviaQA with a 3-model setup, CoE achieves AUROC 0.772, surpassing the previous best baselines UE (0.716) and Semantic Entropy (0.687)
- On SQuAD with a 3-model setup, AUROC reaches 0.878; with 6 models it remains at 0.811, up to 20% above baselines
- CoE-based adjustment heuristic yields +39.0% accuracy gain (KL), vs. JS at +27.5%, Hellinger at +23.0%, and Wasserstein at +20.5%, with asymmetric KL dominating by a wide margin
- Increasing sample count from 4 to 8 improves ensemble accuracy from 81.9% to 96.0%; ensemble accuracy remains stable at 92–95% across temperature range 0.8–1.0
How to Apply
- In a multi-LLM pipeline, sample responses from each model, cluster semantically equivalent answers using bidirectional entailment, compute UA+UE, and use the result as a filter to flag low-confidence queries for further review
- When CoE is high: if UA is high, lower each model's temperature or augment few-shot prompts; if UE is high, implement branching logic that re-weights models or inserts an additional verification step
- Since it attaches as post-processing to already-generated outputs, no model retraining is needed — it can be added as a plug-in to existing multi-LLM ensembles and is immediately applicable for building selective prediction systems that route high-uncertainty cases to human review
Code Example
import numpy as np
from scipy.special import rel_entr
def collaborative_entropy(cluster_probs_list, weights=None):
"""
cluster_probs_list: list of cluster probability distributions for each model
e.g.: [[0.8, 0.1, 0.1], [0.1, 0.8, 0.1], [0.7, 0.2, 0.1]]
weights: per-model weights (uniform if None)
"""
K = len(cluster_probs_list)
if weights is None:
weights = [1.0 / K] * K
probs = [np.array(p) + 1e-10 for p in cluster_probs_list] # avoid zero
probs = [p / p.sum() for p in probs]
# ensemble mean distribution
ensemble_mean = sum(w * p for w, p in zip(weights, probs))
# UA: average semantic entropy across models
def shannon_entropy(p):
return -np.sum(p * np.log(p + 1e-10))
UA = np.mean([shannon_entropy(p) for p in probs])
# UE: weighted sum of KL divergence between each model's distribution and the ensemble mean
UE = sum(w * np.sum(rel_entr(p, ensemble_mean))
for w, p in zip(weights, probs))
CoE = UA + UE
return {"CoE": CoE, "UA": UA, "UE": UE}
# Example: probability distributions over 3 semantic clusters for 3 models
model_outputs = [
[0.8, 0.1, 0.1], # LLaMA: confident in cluster 0
[0.1, 0.8, 0.1], # Qwen: confident in cluster 1 (inter-model disagreement!)
[0.7, 0.2, 0.1], # Mistral: confident in cluster 0
]
result = collaborative_entropy(model_outputs)
print(f"CoE: {result['CoE']:.4f}")
print(f"UA (intra-model): {result['UA']:.4f}")
print(f"UE (inter-model): {result['UE']:.4f}")
# Low UA and high UE -> each model is confident, but they disagree with each otherTerminology
Original Abstract (Expand)
Uncertainty estimation in multi-LLM systems remains largely single-model-centric: existing methods quantify uncertainty within each model but do not adequately capture semantic disagreement across models. To address this gap, we propose Collaborative Entropy (CoE), a unified information-theoretic metric for semantic uncertainty in multi-LLM collaboration. CoE is defined on a shared semantic cluster space and combines two components: intra-model semantic entropy and inter-model divergence to the ensemble mean. CoE is not a weighted ensemble predictor; it is a system-level uncertainty measure that characterizes collaborative confidence and disagreement. We analyze several core properties of CoE, including non-negativity, zero-value certainty under perfect semantic consensus, and the behavior of CoE when individual models collapse to delta distributions. These results clarify when reducing per-model uncertainty is sufficient and when residual inter-model disagreement remains. We also present a simple CoE-guided, training-free post-hoc coordination heuristic as a practical application of the metric. Experiments on \textit{TriviaQA} and \textit{SQuAD} with LLaMA-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Mistral-7B-Instruct show that CoE provides stronger uncertainty estimation than standard entropy- and divergence-based baselines, with gains becoming larger as additional heterogeneous models are introduced. Overall, CoE offers a useful uncertainty-aware perspective on multi-LLM collaboration.