Batch-of-Thought: 여러 쿼리를 묶어서 LLM 추론 성능을 높이는 Cross-Instance Learning

Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning

Jan 6, 2026•Xuan Yang, Furong Jia, Roy Xie +4•View PDF

TL;DR Highlight

LLM이 쿼리를 하나씩 처리하는 대신 묶음(batch)으로 함께 평가하면 정확도는 높이고 비용은 최대 61% 줄일 수 있다.

Who Should Read

LLM 기반 에이전트 파이프라인에서 추론 비용과 신뢰도를 동시에 개선하려는 AI 엔지니어나 ML 엔지니어. 특히 의료, 사기 탐지 등 고위험 판단 태스크에 LLM을 배포하는 개발자.

Core Mechanics

쿼리를 하나씩 처리하지 않고 4~8개씩 묶어서 Reflector 에이전트가 한 번에 비교 평가 → 비용 평균 46.9% 절감
배치 내 다른 쿼리들의 응답을 참조하면서 이상값(outlier) 탐지, 공통 오류 패턴 식별, 확신도 재조정이 가능해짐
GPT-4o 기준 FraudDet에서 기존 Reflection 대비 +4.7%, GPQA에서 +2.9% 정확도 향상
Confidence Calibration(신뢰도 예측 정확도)도 개선 — SMS Spam 데이터셋에서 KS 통계 0.360 → 0.633, ECE 0.104 → 0.063
수학·물리처럼 정확한 기호 연산이 필요한 영역에서는 효과가 없거나 소폭 하락 (배치 합의가 잘못된 방향으로 강화될 수 있음)
별도 학습(파인튜닝) 없이 GPT-4o, Llama-3.3-70B, Qwen3-Next-80B 등 어떤 모델에도 적용 가능한 training-free 방법

Evidence

6개 벤치마크 × 3개 모델 패밀리 실험에서 BoT-R이 ReAct·Reflection 대비 대부분 최고 성능
배치 크기 8 기준 GPT-4o에서 SMS Spam 총 비용 $10.00 → $3.90 (61% 절감), GPQA $7.46 → $4.59 (38% 절감)
시맨틱 배칭 적용 시 FraudDet 0.693 → 0.768 (no-batch 대비 +10.8%), SMS Spam 0.854 → 0.902
Reflector 단계만 보면 최대 71% 비용 절감 (SMS Spam, batch=8)

How to Apply

동일 도메인 쿼리가 여러 개 들어올 때 4~8개씩 묶어서 Reflector 프롬프트에 전부 넣고 '이 응답들을 서로 비교해서 각각 재검토 여부와 confidence를 반환하라'고 요청하면 된다.
스트리밍 환경(실시간 요청)이면 순서 그대로 sequential batching만 해도 효과가 있고, 오프라인 배치 처리라면 E5-Mistral 같은 임베딩 모델로 클러스터링 후 의미적으로 가까운 쿼리끼리 묶으면 추가 성능 향상 가능.
수학 계산, 코드 디버깅처럼 정답이 명확히 하나인 태스크에는 적용하지 말 것 — 배치 합의가 틀린 방향을 강화할 수 있음. 판단·해석이 필요한 분류, QA, 이상 탐지 태스크에 집중 적용할 것.

Code Example

snippet

# BoT-R Reflector 시스템 프롬프트 예시 (배치 크기 N=4)
system_prompt = """
You are a reflection agent. Below are {N} question-answer pairs.

For each pair, compare it against all others in the batch to:
1. Identify inconsistencies or outlier reasoning
2. Extract shared domain patterns
3. Assign a peer confidence score (0.0–1.0)
4. Decide if re-evaluation is needed

Return a JSON list with one entry per question:
[
  {
    "trigger_reevaluation": bool,
    "summary_comment": str,
    "confidence_score": float,
    "suggestions": str
  }
]
"""

# 배치 컨텍스트 구성
def build_batch_context(queries, answers, rationales):
    context = []
    for i, (q, a, r) in enumerate(zip(queries, answers, rationales)):
        context.append(f"--- Instance {i+1} ---\nQ: {q}\nA: {a}\nReasoning: {r}")
    return "\n\n".join(context)

# 사용 예
batch_context = build_batch_context(
    queries=[q1, q2, q3, q4],
    answers=[a1, a2, a3, a4],
    rationales=[r1, r2, r3, r4]
)
reflector_input = system_prompt.format(N=4) + "\n\n" + batch_context

Terminology

Confidence Calibration모델이 '나 80% 확신해'라고 했을 때 실제로 80% 맞아야 잘 보정된 것. LLM은 틀려도 자신만만한 경우가 많아서 이걸 교정하는 게 중요.

ECE (Expected Calibration Error)모델의 예측 신뢰도와 실제 정확도 사이의 평균 오차. 0에 가까울수록 신뢰도 예측이 정확함.

KS Statistic맞힌 문제 vs 틀린 문제의 신뢰도 분포가 얼마나 분리되는지 측정. 높을수록 모델이 맞고 틀림을 잘 구분해서 확신함.

ReActReasoning(생각) + Acting(도구 사용)을 번갈아 하며 문제를 푸는 에이전트 패턴. 생각하고 → 검색하고 → 다시 생각하는 식.

ReflectorActor(답 생성 에이전트)가 낸 답을 검토하고 피드백을 주는 별도 에이전트. BoT에서는 여러 답을 동시에 비교 검토함.

Cross-Instance Learning각 문제를 따로 풀지 않고, 여러 문제를 함께 놓고 서로를 참조하며 패턴을 찾는 학습 방식.

Semantic Batching의미적으로 비슷한 쿼리끼리 묶는 방식. 임베딩 벡터로 유사도를 계산해 클러스터링함.

Related Resources

https://github.com/xuanyang19/BoT

Original Abstract (Expand)

Current Large Language Model reasoning systems process queries independently, discarding valuable cross-instance signals such as shared reasoning patterns and consistency constraints. We introduce Batch-of-Thought (BoT), a training-free method that processes related queries jointly to enable cross-instance learning. By performing comparative analysis across batches, BoT identifies high-quality reasoning templates, detects errors through consistency checks, and amortizes computational costs. We instantiate BoT within a multi-agent reflection architecture (BoT-R), where a Reflector performs joint evaluation to unlock mutual information gain unavailable in isolated processing. Experiments across three model families and six benchmarks demonstrate that BoT-R consistently improves accuracy and confidence calibration while reducing inference costs by up to 61%. Our theoretical and experimental analysis reveals when and why batch-aware reasoning benefits LLM systems. Our code is available at https://github.com/xuanyang19/BoT