Bridging the Last Mile of Time Series Forecasting with LLM Agents | AI Paper Digest

TL;DR Highlight

통계 모델이 만든 예측값을 휴일/캠페인/외부 이벤트 맥락을 반영해 실제 비즈니스에서 쓸 수 있는 수준으로 자동 보정해주는 LLM 에이전트 프레임워크.

Who Should Read

수요 예측 파이프라인을 운영 중인데 통계 모델 결과물을 사람이 수작업으로 보정하는 과정을 자동화하고 싶은 ML 엔지니어나 데이터 사이언티스트. 특히 항공/커머스/물류처럼 공휴일·프로모션 영향이 큰 도메인에서 일하는 팀.

Core Mechanics

기존 예측 연구는 '좋은 베이스라인 만들기'에 집중했지만, 실제 현업에서는 그 베이스라인을 휴일·캠페인·전문가 피드백으로 보정하는 'last-mile' 단계가 따로 존재함.
LLM 에이전트가 예측 자체를 대체하는 게 아니라, TimesFM 같은 기존 파운데이션 모델이 만든 베이스라인 위에서 '맥락 기반 수정 레이어'로만 동작하는 구조.
예측값 수정은 숫자를 직접 덮어쓰는 게 아니라 range_transform(날짜 범위에 곱셈/덧셈), point_override(특정 날 값 교체) 같은 제한된 액션 인터페이스로만 가능 — 이렇게 해야 수정 이력이 감사(audit) 가능해짐.
장기 예측(126일)은 map-reduce 방식으로 처리: 메인 에이전트가 이벤트 윈도우를 식별하고, 각 윈도우마다 로컬 리즈너를 독립 실행해 revision proposal을 받아 취합.
예측 실수를 기억하는 memory bank 구조 포함: 실제값이 나오면 베이스라인 vs 수정값 vs 실제값을 비교해 structured 교훈을 저장하고, 다음 예측 세션에서 검색해서 활용.
smolagents(HuggingFace 오픈소스) 기반으로 구현됐고, 코드 실행 에이전트로 동작하며 forecasting tool, historical retrieval, holiday lookup, memory query, map-reduce planner 총 5개 툴을 사용.

Evidence

춘절 휴일 구간에서 MAPE 기준: Prophet 155.95%, TimesFM 262.76% → 본 프레임워크 32.84%로 감소. MAE는 Prophet 대비 80.0%, TimesFM 대비 88.2% 감소.
126일 장기 예측에서 전체 MAPE: Prophet 27.3%, TimesFM 38.1% → 본 프레임워크 18.91%. 4개 이벤트 윈도우 전부에서 최저 오차 기록 (New Year MAE 28.42, Labor Day MAE 58.96).
자기개선 실험(W3): memory 없을 때 MAE 92.94 / MAPE 7.96% → memory 있을 때 MAE 60.10 / MAPE 5.23%로 개선. memory bank가 baseline 부족분이 1.025→1.181로 커지는 트렌드를 학습해 상향 보정 힌트로 활용.

How to Apply

기존에 Prophet이나 TimesFM으로 수요 예측 후 사람이 Excel에서 공휴일 보정하는 워크플로우라면, 이 프레임워크 구조를 참고해 holiday_lookup 툴 + range_transform 액션으로 보정 로직을 에이전트화하면 된다.
장기 예측(30일+)에서 여러 이벤트가 겹치는 경우, 단일 LLM 호출로 전체 보정을 하려 하지 말고 이벤트 윈도우별로 로컬 리즈너를 분리 실행하는 map-reduce 패턴을 적용하면 이벤트별 맥락 혼용을 방지할 수 있다.
예측 보정 후 실제값이 나왔을 때 단순히 오차만 로깅하지 말고, '어떤 이벤트에 어떤 배율을 적용했고 실제로는 어떤 배율이었는지'를 structured 형태로 저장해두면 다음 동일 이벤트 예측 시 retrieval 기반 사전 정보로 활용할 수 있다.

Code Example

snippet

## Main Agent Prompt 핵심 구조 (논문 Appendix B 기반)

system_prompt = """
## Role
You are a forecast-revision assistant. A time-series foundation
model provides the numerical baseline; your job is to make evidence-backed,
auditable revisions to that baseline. Do not replace the baseline model.

## Workspace contract
The sandbox exposes a dataframe 'df' with columns:
  ds, y, y_baseline, y_final
and a helper-owned 'adjustment_log'. 'last_reflection_summary' carries
post-mortem lessons from previous realized forecasts.

## Non-negotiable rules
- Observed 'y' is immutable; existing 'y_baseline' is immutable.
- Only revise 'y_final', and only through the provided helpers.
- Every applied revision needs concrete 'evidence' and a 'confidence'.
- Do not stack multiple revisions on the same range without distinct evidence.
- If the baseline already captures an event, skip the revision.

## Workflow
1. Inspect the series (range, frequency, trend, anomalies).
2. Ensure the baseline exists; otherwise call forecast_tool, then append_forecast.
3. Consult last_reflection_summary; prefer realized lessons over fresh guesses.
4. For each calendar-relevant event, gather evidence.
5. If horizon contains MORE THAN ONE event:
     → call run_map_reduce_planners(tasks, context)
     → then apply_json_policies
   For a single event:
     → edit y_final directly via adjust_by_date_range or override_forecast_values
6. Self-review adjustment_log for empty evidence or implausible impact.

## Revision policy (evidence priority)
realized multipliers from reflection > memory critiques > historical same-period ratios > user instructions.
"""

## Self-improvement prompt (실제값 나온 후 실행)
reflection_prompt = f"""
You are a forecast-revision post-mortem expert.
Event: {event}
Action taken: {code}

Realized comparison:
- Baseline (y_baseline mean): {baseline_mean:.2f}
- Agent-revised (y_final mean): {final_mean:.2f}
- Actual (y_actual mean): {actual_mean:.2f}

The agent's intervention increased the error.
Summarize the lesson in one sentence as a critique for future similar events.
Output a single plain-text sentence, no formatting.
"""

Terminology

Last-mile forecasting통계 모델이 만든 예측값을 실제 비즈니스에서 쓸 수 있도록 맥락 정보(공휴일, 캠페인 등)로 보정하는 마지막 단계. 택배의 '라스트 마일' 배송처럼, 전체 과정에서 가장 손이 많이 가는 마지막 구간.

TimesFMGoogle이 만든 시계열 예측 파운데이션 모델(foundation model). 도메인별로 따로 학습 안 해도 zero-shot으로 다양한 시계열 예측 가능.

Forecast workspace예측 수정 과정을 관리하는 공유 상태 테이블. 과거 실제값(y), 베이스라인 예측(y_baseline), 수정 중인 예측(y_final)을 같은 시간 인덱스로 관리해서 에이전트가 뭘 바꿨는지 추적 가능.

Map-reduce원래 대용량 데이터 처리 패턴인데, 여기선 긴 예측 기간을 이벤트 구간별로 쪼개서(map) 각각 처리한 후 합치는(reduce) 방식으로 활용.

Memory bank과거 예측 실수를 구조화해서 저장해두는 장기 기억 저장소. 다음 예측 시 비슷한 이벤트가 오면 이 메모리를 검색해서 '저번에 이 공휴일 보정 너무 크게 했었음' 같은 교훈을 활용.

MAE / MAPE예측 정확도 지표. MAE(평균 절대 오차)는 예측값과 실제값 차이의 평균, MAPE(평균 절대 백분율 오차)는 그 차이를 실제값 대비 %로 표현. 숫자가 낮을수록 정확.

Zero-shot특정 도메인 데이터로 따로 학습(fine-tuning) 안 하고 바로 예측하는 것. 새 문제를 보자마자 푸는 것과 비슷.

Related Papers

Related Resources

Original Abstract (Expand)

Time series forecasting has advanced rapidly, especially with the emergence of foundation models that show strong zero-shot performance on numerical extrapolation. However, in real-world forecasting settings, a statistically plausible baseline is rarely the final forecast used in practice. Before a forecast becomes decision-ready, it often needs to be revised using weakly structured business context such as holiday effects, campaign plans, external events, historical analogs, and expert feedback. This practical stage remains underexplored in the forecasting literature. In this paper, we formulate this stage as the \textbf{last-mile forecasting} problem and present an LLM-agent framework that sits on top of a forecasting backbone. Our system maintains a unified forecast workspace, invokes tools to retrieve contextual evidence, and converts reasoning trajectories into explicit forecast revision actions under structural safety constraints. It also supports long-horizon forecasting through map-reduce-style decomposition and post-hoc reflection through a memory bank. The resulting system is designed to be controllable and auditable. Through real-world case studies, we show how LLM agents can bridge the gap between statistical prediction and business-ready forecasting.