Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning

Jan 6, 2026•Xuan Yang, Furong Jia, Roy Xie +4•View PDF

TL;DR Highlight

Instead of processing queries one at a time, having an LLM evaluate them together in batches can improve accuracy while reducing costs by up to 61%.

Who Should Read

AI engineers or ML engineers looking to simultaneously improve inference costs and reliability in LLM-based agent pipelines. Especially developers deploying LLMs for high-stakes decision tasks such as healthcare and fraud detection.

Core Mechanics

Instead of processing queries one at a time, grouping 4–8 queries together for the Reflector agent to comparatively evaluate in a single pass → average cost reduction of 46.9%
By referencing responses from other queries within the batch, outlier detection, common error pattern identification, and confidence re-calibration become possible
With GPT-4o, accuracy improved by +4.7% on FraudDet and +2.9% on GPQA compared to standard Reflection
Confidence Calibration also improved — on the SMS Spam dataset, KS statistic went from 0.360 → 0.633, and ECE from 0.104 → 0.063
No effect or slight degradation in domains requiring precise symbolic computation like math and physics (batch consensus can reinforce incorrect directions)
A training-free method applicable to any model including GPT-4o, Llama-3.3-70B, and Qwen3-Next-80B without additional fine-tuning

Evidence

Across experiments on 6 benchmarks × 3 model families, BoT-R achieved top performance in most cases compared to ReAct and Reflection
At batch size 8 with GPT-4o, total cost for SMS Spam dropped from $10.00 → $3.90 (61% reduction), and GPQA from $7.46 → $4.59 (38% reduction)
With semantic batching applied, FraudDet improved from 0.693 → 0.768 (+10.8% vs. no-batch), SMS Spam from 0.854 → 0.902
Looking at the Reflector stage alone, up to 71% cost reduction was achieved (SMS Spam, batch=8)

How to Apply

When multiple queries from the same domain arrive, group them in sets of 4–8, include all of them in the Reflector prompt, and request: 'Compare these responses against each other and return whether each should be re-evaluated along with its confidence score.'
In a streaming environment (real-time requests), sequential batching alone is effective; for offline batch processing, clustering with an embedding model like E5-Mistral to group semantically similar queries together can yield additional performance gains.
Do not apply this to tasks with a single definitive correct answer, such as math calculations or code debugging — batch consensus can reinforce incorrect directions. Focus application on classification, QA, and anomaly detection tasks that require judgment and interpretation.

Code Example

snippet

# Example BoT-R Reflector system prompt (batch size N=4)
system_prompt = """
You are a reflection agent. Below are {N} question-answer pairs.

For each pair, compare it against all others in the batch to:
1. Identify inconsistencies or outlier reasoning
2. Extract shared domain patterns
3. Assign a peer confidence score (0.0–1.0)
4. Decide if re-evaluation is needed

Return a JSON list with one entry per question:
[
  {
    "trigger_reevaluation": bool,
    "summary_comment": str,
    "confidence_score": float,
    "suggestions": str
  }
]
"""

# Build batch context
def build_batch_context(queries, answers, rationales):
    context = []
    for i, (q, a, r) in enumerate(zip(queries, answers, rationales)):
        context.append(f"--- Instance {i+1} ---\nQ: {q}\nA: {a}\nReasoning: {r}")
    return "\n\n".join(context)

# Usage example
batch_context = build_batch_context(
    queries=[q1, q2, q3, q4],
    answers=[a1, a2, a3, a4],
    rationales=[r1, r2, r3, r4]
)
reflector_input = system_prompt.format(N=4) + "\n\n" + batch_context

Terminology

Confidence CalibrationWhen a model says 'I'm 80% confident,' it should actually be correct 80% of the time for it to be well-calibrated. LLMs often express high confidence even when wrong, so correcting this is important.

ECE (Expected Calibration Error)The average discrepancy between a model's predicted confidence and its actual accuracy. The closer to 0, the more accurate the confidence predictions.

KS StatisticMeasures how well the confidence distributions of correctly answered vs. incorrectly answered questions are separated. A higher value means the model is better at expressing high confidence when it's right and low confidence when it's wrong.

ReActAn agent pattern that alternates between Reasoning (thinking) and Acting (tool use) to solve problems — think, then search, then think again.

ReflectorA separate agent that reviews and provides feedback on answers generated by the Actor (answer-generating agent). In BoT, it comparatively reviews multiple answers simultaneously.

Cross-Instance LearningInstead of solving each problem in isolation, a learning approach where multiple problems are placed side by side and patterns are identified by referencing each other.

Semantic BatchingA method of grouping queries that are semantically similar. Similarity is computed using embedding vectors for clustering.

Related Resources

https://github.com/xuanyang19/BoT

Original Abstract (Expand)

Current Large Language Model reasoning systems process queries independently, discarding valuable cross-instance signals such as shared reasoning patterns and consistency constraints. We introduce Batch-of-Thought (BoT), a training-free method that processes related queries jointly to enable cross-instance learning. By performing comparative analysis across batches, BoT identifies high-quality reasoning templates, detects errors through consistency checks, and amortizes computational costs. We instantiate BoT within a multi-agent reflection architecture (BoT-R), where a Reflector performs joint evaluation to unlock mutual information gain unavailable in isolated processing. Experiments across three model families and six benchmarks demonstrate that BoT-R consistently improves accuracy and confidence calibration while reducing inference costs by up to 61%. Our theoretical and experimental analysis reveals when and why batch-aware reasoning benefits LLM systems. Our code is available at https://github.com/xuanyang19/BoT