Bayesian Linguistic Forecaster: Sequential Bayesian Updating으로 미래 예측하는 Agentic 시스템

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

Apr 20, 2026•Kevin Murphy•View PDF

TL;DR Highlight

LLM이 검색할 때마다 확률 추정치를 JSON 형태로 업데이트하는 Bayesian 믿음 상태 방식이 웹 검색보다 더 중요한 성능 향상 요소임을 입증한 예측 시스템.

Who Should Read

LLM 기반 예측 시스템이나 Research Agent를 개발하는 백엔드 개발자. RAG 파이프라인에서 단순 컨텍스트 누적 방식의 한계를 느끼고 더 구조화된 상태 관리를 고민하는 엔지니어.

Core Mechanics

핵심 아이디어는 'Bayesian Linguistic Belief State': 매 검색 스텝마다 LLM이 확률 추정치 + 근거 요약 + 오픈 질문을 JSON으로 업데이트하며, 컨텍스트에 텍스트를 무한정 쌓는 기존 방식과 다름.
믿음 상태(Belief State) 제거 시 성능(Brier Index)이 5.1 하락하는데, 이는 웹 검색 제거(3.4 하락)보다 더 큰 영향 - 구조화된 상태 관리가 검색 자체보다 중요하다는 뜻.
5개 독립 Trial을 병렬로 실행하고 평균을 내는 Multi-Trial Aggregation 사용 - LLM 예측의 높은 분산(변동성)을 줄이는 효과. 단, 이 평균화는 Brier Score(이차 손실)에는 도움이 되지만 Brier Index(선형 손실)에는 이론적으로 효과 없음.
Hierarchical Platt Scaling(소스별 오프셋을 가진 보정 방식)으로 캘리브레이션 수행 - 전역 Platt Scaling 대비 모든 설정에서 성능 우수, 특히 극단적 base rate를 가진 소스(예: Wikipedia 백신 문제)에서 over-shrinking 방지.
Batch 방식(병렬 검색 후 한 번에 추론) 대비 Sequential 방식(한 번씩 검색하고 믿음 업데이트 반복)이 Brier Index 기준 7.7 더 높음 - 단일 최대 성능 차이 요인.
모델 앙상블(Ensemble)은 동일 아키텍처 모델 간에는 효과 없음. 같은 프롬프트/툴/검색 결과를 쓰면 다양성(Jensen-Shannon Divergence 0.006~0.014)이 너무 낮아서 오히려 성능 하락.

Evidence

ForecastBench 400개 백테스트 질문에서 BLF의 전체 Brier Index 83.5 달성 - 2위 GPT-5(79.9), Cassi(79.5), Grok 4.20(78.8), Foresight-32B(79.2) 모두 p<0.001 수준으로 유의미하게 앞섬.
Difficulty-Adjusted Brier Index(ABI) 71.0으로 ForecastBench 리더보드에 보고된 인간 슈퍼포캐스터 중앙값(ABI=70.9)과 동등한 수준 달성.
시장 질문(Polymarket 등)에서 BLF만이 crowd baseline(시장 가격, BI=90.6)을 유의미하게 능가(+4.2, p<0.001) - GPT-5, Cassi, Grok은 모두 시장 가격과 통계적으로 유의미한 차이 없음.
4단계 정보 누출 방어 시스템으로 백테스트 데이터 오염률 1.5% 이하 달성 - 2,272개 검색 결과 감사 결과 런타임 필터가 93.8% 누출 감지, 실제 에이전트가 본 결과 중 누출은 21/1,375(1.5%).

How to Apply

RAG 에이전트에서 검색 결과를 컨텍스트에 계속 append하는 대신, 매 검색 후 LLM이 {probability, confidence, evidence_for, evidence_against, open_questions} 형태의 JSON 믿음 상태를 업데이트하도록 프롬프트를 구성하면 됨. 이 structured belief가 다음 검색 쿼리 결정에도 활용됨.
여러 번 같은 질문을 독립적으로 실행(K=5 추천)하고 평균 내는 Multi-Trial 패턴 적용 시, 평가 메트릭이 이차 손실(Brier Score)이면 Jensen 부등식에 의해 평균화가 보장된 개선을 줌. 단, 선형 손실 기반 지표라면 이론적으로 효과 없으므로 median aggregation이 약간 유리.
소스 타입별로 다른 Calibration이 필요한 시스템(예: 시장 데이터, 경제 지표, 위키피디아 등 혼합)에서는 전역 Platt Scaling 대신 소스별 intercept offset을 둔 Hierarchical Platt Scaling을 적용하면 zero-shot 기준 BI +3.5 개선 가능.

Code Example

snippet

# BLF Bayesian Linguistic Belief State 구조 예시
# 매 Tool 호출 시 LLM이 이 JSON을 함께 생성하도록 프롬프트 구성

belief_state_schema = {
    "probability": 0.5,          # P(outcome=1 | evidence so far)
    "confidence": "low",         # low / medium / high
    "evidence_for": [],           # 결과 지지하는 근거 요약 리스트
    "evidence_against": [],       # 결과 반박하는 근거 요약 리스트
    "open_questions": [],         # 다음에 검색할 질문들
    "update_reasoning": ""        # 이번 스텝에서 왜 확률이 바뀌었는지
}

# 에이전트 루프 의사코드
def blf_agent(question: str, cutoff_date: str, max_steps: int = 10):
    belief = {"probability": 0.5, "confidence": "low", 
              "evidence_for": [], "evidence_against": [],
              "open_questions": [question], "update_reasoning": "Prior"}
    history = [{"role": "user", "content": question}]
    
    for step in range(max_steps):
        # LLM이 action + 업데이트된 belief를 한 번에 생성
        response = llm_call(
            messages=history,
            tools=[web_search, read_files, url_lookup, submit],
            # 핵심: tool call arguments에 updated_belief 필드 포함
            system_prompt=f"""After each action, update your belief state as JSON:
{belief_state_schema}
Current belief: {belief}
Cutoff date: {cutoff_date} - do not use information after this date."""
        )
        
        action = response.tool_call
        new_belief = response.updated_belief  # LLM이 함께 생성
        
        if action.name == "submit":
            return action.args["probability"]
        
        # 액션 실행 (날짜 필터링 포함)
        observation = execute_tool(action, cutoff_date=cutoff_date)
        
        # 히스토리 업데이트
        history.append({"role": "assistant", "tool_call": action, "belief": new_belief})
        history.append({"role": "tool", "content": observation})
        belief = new_belief
    
    return belief["probability"]  # max_steps 도달 시 마지막 belief 반환

# Multi-Trial Aggregation
import numpy as np

def aggregate_trials(forecasts: list[float], metric: str = "brier_score") -> float:
    """metric이 이차 손실이면 mean, 선형이면 median이 약간 유리"""
    if metric in ["brier_score", "log_score"]:
        return np.mean(forecasts)  # Jensen 부등식으로 이론적 보장
    else:  # brier_index (선형)
        return np.median(forecasts)  # +0.2 BI 개선

# 5개 독립 실행
K = 5
forecasts = [blf_agent(question, cutoff_date) for _ in range(K)]
final_forecast = aggregate_trials(forecasts)

Terminology

Brier Score예측값과 실제 결과(0 또는 1)의 차이를 제곱한 값. 낮을수록 좋음. 0에 가까울수록 완벽한 예측.

Brier IndexBrier Score를 사람이 읽기 쉽게 변환한 지표(100 × (1 - |예측 - 결과|)). 높을수록 좋고, 항상 0.5를 예측하면 50점.

Platt Scaling모델 출력 확률이 실제와 얼마나 잘 맞는지(캘리브레이션)를 보정하는 후처리 기법. 로지스틱 회귀로 raw 확률을 재조정하는 것.

Hierarchical Platt Scaling전체에 하나의 보정 함수를 적용하는 대신, 데이터 소스별로 개별 조정값을 추가한 보정 방식. 소스마다 특성이 다를 때 더 효과적.

Calibration모델이 '70% 확률'이라고 말했을 때 실제로 70%의 경우에 맞는지를 나타내는 개념. 잘 보정된 모델은 자신감 수준과 정확도가 일치함.

James-Stein Shrinkage여러 추정값을 공통 평균 쪽으로 살짝 당기는 통계 기법. 개별 추정의 오차를 줄이는 대신 약간의 편향을 감수하는 트레이드오프.

POMDPPartially Observable Markov Decision Process. 에이전트가 환경의 일부만 볼 수 있는 상황에서 의사결정하는 수학적 프레임워크. 체스 대신 포커처럼 불완전 정보 게임에 가까운 상황.

Jensen's Inequality볼록 함수(convex function)에서 평균의 함수값 ≤ 함수값의 평균임을 보장하는 수학 법칙. 여러 예측을 평균 내면 이차 손실(Brier Score)은 반드시 개선됨을 이론적으로 보장.

Related Resources

Original Abstract (Expand)

We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.