TimeCAP: LLM Agent로 Time Series 이벤트를 맥락화·증강·예측하기

TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents

Feb 17, 2025•Geon Lee, Wenchao Yu, Kijung Shin +2•View PDF

TL;DR Highlight

GPT-4 에이전트 두 개를 분업시켜 시계열 데이터를 텍스트로 맥락화한 뒤 예측하면 F1 스코어가 평균 28.75% 오른다.

Who Should Read

날씨·금융·헬스케어 도메인에서 시계열 이벤트 분류 파이프라인을 설계하는 ML 엔지니어. LLM을 단순 예측기가 아닌 데이터 전처리 에이전트로 활용하고 싶은 개발자.

Core Mechanics

LLM을 '예측기'로만 쓰지 않고 '맥락화기(contextualizer)'로 먼저 쓴다 — GPT-4 에이전트 #1이 시계열을 텍스트 요약으로 변환하고, 에이전트 #2가 그 요약을 보고 이벤트를 예측하는 2-에이전트 구조
Multi-Modal Encoder(BERT + Patch Transformer)가 시계열 원본 + 텍스트 요약을 함께 학습해서 임베딩을 생성, 이 임베딩으로 학습셋에서 유사한 예시를 k=5개 꺼내 GPT-4 예측 프롬프트에 in-context example로 붙임
Encoder 예측과 LLM 예측을 λ 가중치로 선형 결합(Fused Prediction)하면 각각 단독보다 성능이 더 높아짐
학습 데이터가 0%인 zero-shot 상황에서도 기존 LLM 기반 방법(PromptCast, LLMTime)보다 높은 F1 달성 — 데이터 부족 환경에서도 유효
예측 근거를 Implicit(LLM이 rationale 직접 생성) / Explicit(가장 유사한 in-context 예시를 지목) 두 가지 방식으로 해석 가능하게 제공
날씨 3개, 금융 2개, 헬스케어 2개 총 7개 실제 데이터셋과 GPT-4가 생성한 텍스트 요약을 GitHub에 공개

Evidence

기존 SOTA 대비 평균 F1 스코어 28.75% 향상, 일부 데이터셋에서는 최대 157% 향상
맥락화만 추가한 TimeCP도 zero-shot LLM 기반 베이스라인(PromptCast) 대비 전 데이터셋에서 우위 — NY 기상 F1: 0.499 → 0.625
Multi-Modal Encoder의 in-context 샘플링이 PatchTST 기반 KNN보다 높은 F1 — 헬스케어 도메인 0.657 → 0.736(KNN 분류기 기준)
학습 데이터를 10%로 줄여도 TimeCAP은 PatchTST·GPT4TS 대비 성능 하락폭이 작음 (Figure 4 결과)

How to Apply

기존 시계열 분류 파이프라인에 GPT-4 호출을 전처리 단계로 추가: 원시 시계열 수치를 프롬프트로 넘겨 '이 데이터의 도메인 맥락을 텍스트로 요약해줘'라고 요청한 뒤, 그 요약문을 피처로 활용하면 된다.
학습 데이터가 충분하다면 BERT 같은 소형 LM + Patch Transformer로 멀티모달 인코더를 파인튜닝하고, 추론 시 임베딩 유사도로 학습셋에서 관련 예시 5개를 꺼내 GPT-4 프롬프트에 붙이는 RAG 패턴을 적용한다.
데이터가 거의 없는 콜드스타트 상황이라면 TimeCP 구조(에이전트 2개, 학습 없음)만 써도 기존 zero-shot 방식보다 나은 결과를 기대할 수 있다.

Code Example

snippet

# TimeCAP 핵심 흐름 의사코드
import openai
from sentence_transformers import util

# Step 1: Agent AC — 시계열 → 텍스트 요약
def contextualize(time_series_str: str) -> str:
    prompt = f"""다음은 지난 24시간의 기상 시계열 데이터입니다.
온도, 습도, 기압, 풍속, 풍향 값이 시간 순으로 나열되어 있습니다.
이 데이터의 기상 패턴과 도메인 맥락을 전문가 관점에서 텍스트로 요약하세요.

데이터: {time_series_str}"""
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Step 2: Multi-Modal Encoder로 임베딩 생성 후 유사 예시 검색
def retrieve_in_context_examples(query_embedding, train_embeddings, train_summaries, train_labels, k=5):
    scores = util.cos_sim(query_embedding, train_embeddings)[0]
    top_k_idx = scores.topk(k).indices
    return [(train_summaries[i], train_labels[i]) for i in top_k_idx]

# Step 3: Agent AP — 맥락 요약 + in-context 예시 → 이벤트 예측
def predict_with_context(summary: str, in_context_examples: list) -> str:
    examples_str = "\n".join([
        f"예시 {i+1}: {s}\n결과: {l}"
        for i, (s, l) in enumerate(in_context_examples)
    ])
    prompt = f"""아래 과거 사례들을 참고하여 현재 상황의 이벤트를 예측하세요.

[과거 사례]
{examples_str}

[현재 상황 요약]
{summary}

내일 비가 올지 예측하고 근거를 설명하세요. 답변: Rain 또는 Not Rain"""
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# 실행 예시
ts_data = "temp: [15.2, 14.8, 14.1, ...], humidity: [72, 75, 80, ...]"
summary = contextualize(ts_data)
examples = retrieve_in_context_examples(query_emb, train_embs, train_summaries, train_labels)
result = predict_with_context(summary, examples)
print(result)

Terminology

LMaaSLanguage Model as a Service. GPT-4 API처럼 모델 내부를 건드리지 않고 API 호출만으로 LLM을 사용하는 방식. 내부 가중치에 접근 불가능한 블랙박스 상황.

In-Context LearningLLM 프롬프트 안에 예제 몇 개를 넣어주면 모델이 그걸 보고 규칙을 파악해 새 입력에 적용하는 방식. 학습 없이 예제만으로 모델 행동을 유도하는 것.

Patching시계열을 일정 길이의 작은 조각(패치)으로 잘라서 Transformer에 입력하는 기법. 이미지의 패치 임베딩과 같은 원리로, 긴 시계열을 효율적으로 처리.

CLS TokenBERT 같은 언어 모델에서 문장 전체를 대표하는 특수 토큰. 이 토큰의 임베딩을 문장의 요약 벡터로 사용.

Multi-Modal Encoder텍스트와 시계열 두 가지 서로 다른 형태의 데이터를 동시에 받아 하나의 표현 벡터로 만드는 모델. 여러 감각기관의 정보를 뇌가 통합하는 것과 비슷.

Zero-shot학습 예제 없이 처음 보는 문제를 바로 푸는 것. 시험 공부 없이 시험 보는 셈.

AUROC모델이 양성과 음성을 얼마나 잘 구분하는지 나타내는 지표(0~1). 1에 가까울수록 좋고, 0.5는 동전 던지기 수준.

Related Resources

Original Abstract (Expand)

Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real-world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.