Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning
TL;DR Highlight
Instead of processing queries one at a time, having an LLM evaluate them together in batches can improve accuracy while reducing costs by up to 61%.
Who Should Read
AI engineers or ML engineers looking to simultaneously improve inference costs and reliability in LLM-based agent pipelines. Especially developers deploying LLMs for high-stakes decision tasks such as healthcare and fraud detection.
Core Mechanics
- Instead of processing queries one at a time, grouping 4–8 queries together for the Reflector agent to comparatively evaluate in a single pass → average cost reduction of 46.9%
- By referencing responses from other queries within the batch, outlier detection, common error pattern identification, and confidence re-calibration become possible
- With GPT-4o, accuracy improved by +4.7% on FraudDet and +2.9% on GPQA compared to standard Reflection
- Confidence Calibration also improved — on the SMS Spam dataset, KS statistic went from 0.360 → 0.633, and ECE from 0.104 → 0.063
- No effect or slight degradation in domains requiring precise symbolic computation like math and physics (batch consensus can reinforce incorrect directions)
- A training-free method applicable to any model including GPT-4o, Llama-3.3-70B, and Qwen3-Next-80B without additional fine-tuning
Evidence
- Across experiments on 6 benchmarks × 3 model families, BoT-R achieved top performance in most cases compared to ReAct and Reflection
- At batch size 8 with GPT-4o, total cost for SMS Spam dropped from $10.00 → $3.90 (61% reduction), and GPQA from $7.46 → $4.59 (38% reduction)
- With semantic batching applied, FraudDet improved from 0.693 → 0.768 (+10.8% vs. no-batch), SMS Spam from 0.854 → 0.902
- Looking at the Reflector stage alone, up to 71% cost reduction was achieved (SMS Spam, batch=8)
How to Apply
- When multiple queries from the same domain arrive, group them in sets of 4–8, include all of them in the Reflector prompt, and request: 'Compare these responses against each other and return whether each should be re-evaluated along with its confidence score.'
- In a streaming environment (real-time requests), sequential batching alone is effective; for offline batch processing, clustering with an embedding model like E5-Mistral to group semantically similar queries together can yield additional performance gains.
- Do not apply this to tasks with a single definitive correct answer, such as math calculations or code debugging — batch consensus can reinforce incorrect directions. Focus application on classification, QA, and anomaly detection tasks that require judgment and interpretation.
Code Example
# Example BoT-R Reflector system prompt (batch size N=4)
system_prompt = """
You are a reflection agent. Below are {N} question-answer pairs.
For each pair, compare it against all others in the batch to:
1. Identify inconsistencies or outlier reasoning
2. Extract shared domain patterns
3. Assign a peer confidence score (0.0–1.0)
4. Decide if re-evaluation is needed
Return a JSON list with one entry per question:
[
{
"trigger_reevaluation": bool,
"summary_comment": str,
"confidence_score": float,
"suggestions": str
}
]
"""
# Build batch context
def build_batch_context(queries, answers, rationales):
context = []
for i, (q, a, r) in enumerate(zip(queries, answers, rationales)):
context.append(f"--- Instance {i+1} ---\nQ: {q}\nA: {a}\nReasoning: {r}")
return "\n\n".join(context)
# Usage example
batch_context = build_batch_context(
queries=[q1, q2, q3, q4],
answers=[a1, a2, a3, a4],
rationales=[r1, r2, r3, r4]
)
reflector_input = system_prompt.format(N=4) + "\n\n" + batch_contextTerminology
Related Resources
Original Abstract (Expand)
Current Large Language Model reasoning systems process queries independently, discarding valuable cross-instance signals such as shared reasoning patterns and consistency constraints. We introduce Batch-of-Thought (BoT), a training-free method that processes related queries jointly to enable cross-instance learning. By performing comparative analysis across batches, BoT identifies high-quality reasoning templates, detects errors through consistency checks, and amortizes computational costs. We instantiate BoT within a multi-agent reflection architecture (BoT-R), where a Reflector performs joint evaluation to unlock mutual information gain unavailable in isolated processing. Experiments across three model families and six benchmarks demonstrate that BoT-R consistently improves accuracy and confidence calibration while reducing inference costs by up to 61%. Our theoretical and experimental analysis reveals when and why batch-aware reasoning benefits LLM systems. Our code is available at https://github.com/xuanyang19/BoT