Just-In-Time Reinforcement Learning: Gradient 업데이트 없이 LLM 에이전트의 지속 학습

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

Jan 26, 2026•Yibo Li, Zijie Lin, Ailin Deng +5•View PDF

TL;DR Highlight

배포된 LLM 에이전트가 gradient 업데이트 없이 경험 메모리만으로 실시간 강화학습을 수행해, WebRL 대비 30배 저렴하면서도 더 높은 성능을 달성한다.

Who Should Read

웹 자동화, 게임 플레이, RPA 등 반복 태스크를 수행하는 LLM 에이전트를 운영하며 모델 재학습 없이 성능을 개선하고 싶은 AI 엔지니어. 특히 API 비용을 낮추면서 에이전트가 경험을 통해 스스로 학습하길 원하는 개발자.

Core Mechanics

과거 경험을 <state, action, reward> 트리플렛으로 메모리 뱅크에 저장해두고, 추론 시 유사 상황을 검색해 각 액션의 advantage(평균 대비 얼마나 좋은지)를 추정
Advantage 추정값을 LLM의 output logits에 직접 더해서 정책을 업데이트 — 파라미터는 전혀 건드리지 않음
이 logit 업데이트 수식이 KL-constrained 정책 최적화 문제의 정확한 closed-form 해임을 수학적으로 증명
처음 보는 액션엔 UCB(불확실성 기반 탐험 보너스)를 적용해 탐험-활용 균형을 자동 조절, 메모리가 쌓일수록 보너스가 줄어듦
Gemini-2.5-flash, GPT-5-mini, DeepSeek-V3.2 등 여러 모델에 backbone 무관하게 적용 가능하며, 학습한 적 없는 태스크로도 cross-task 전이 가능
에피소드 종료 후 LLM Evaluator가 각 스텝별 보상을 자동 할당해 크레딧 문제(어떤 행동이 성공에 기여했는지)를 해결

Evidence

WebArena-Lite 최종 성공률: JitRL 60.00% vs WebRL(파인튜닝) 46.06% vs SFT 23.0% — 파인튜닝 방법을 inference-only로 추월
비용: JitRL $290 vs WebRL ~$9,900 — 30배 이상 저렴 (WebRL은 H200 16개로 154시간 학습 필요)
Jericho Zork1 최종 점수: JitRL 69 vs EvoTest 54 vs GRPO(gradient 업데이트) 10 — 강화학습 기반 방법도 압도
WebArena Shopping 도메인: Static 대비 +73.2% 성공률 향상 (25.0% → 45.83%)

How to Apply

에이전트가 태스크를 반복 수행하는 환경(웹 자동화, 고객지원 봇 등)에서 각 스텝을 (state, action, discounted_return) 트리플렛으로 저장하고, 다음 실행 시 Jaccard similarity로 유사 상황을 검색해 logit을 보정하면 된다
블랙박스 API(logprob 미제공)를 쓰는 경우엔 'Verbalized Logit' 방식으로 모델에게 0~100 점수를 직접 출력하게 한 뒤 logit으로 변환해 동일하게 적용 가능
RAG처럼 검색 증거를 프롬프트에 넣는 것보다 logit을 직접 수정하는 게 더 효과적 — 컨텍스트가 길어질수록 프롬프트 기반 방법은 성능이 저하되지만 logit 수정은 무관

Code Example

snippet

# JitRL 핵심 로직 스니펫 (Python 의사코드)
import numpy as np
from collections import defaultdict

class JitRLMemory:
    def __init__(self):
        self.memory = []  # List of (state, action, G_t)
    
    def jaccard_similarity(self, s1_tokens: set, s2_tokens: set) -> float:
        if not s1_tokens and not s2_tokens:
            return 1.0
        return len(s1_tokens & s2_tokens) / len(s1_tokens | s2_tokens)
    
    def retrieve_neighbors(self, state: str, k: int = 10) -> list:
        state_tokens = set(state.split())
        scored = [
            (self.jaccard_similarity(state_tokens, set(s.split())), s, a, g)
            for s, a, g in self.memory
        ]
        scored.sort(reverse=True)
        return scored[:k]
    
    def update_logits(
        self,
        state: str,
        candidate_actions: list,
        base_logits: dict,  # {action: logit}
        beta: float = 1.0,
        k: int = 10
    ) -> dict:
        neighbors = self.retrieve_neighbors(state, k)
        
        # State value (baseline)
        V_s = np.mean([g for _, _, _, g in neighbors]) if neighbors else 0.0
        
        updated_logits = {}
        for action in candidate_actions:
            # Action value from matching neighbors
            matching = [g for _, _, a, g in neighbors if a == action]
            if matching:
                Q_sa = np.mean(matching)
            else:
                # Exploration bonus for unseen actions
                Q_sa = V_s + 5.0 / (len(neighbors) + 1e-8)
            
            advantage = Q_sa - V_s
            base = base_logits.get(action, 0.0)
            updated_logits[action] = base + beta * advantage
        
        return updated_logits
    
    def add_trajectory(self, trajectory: list, gamma: float = 0.9):
        # trajectory: [(state, action, reward), ...]
        T = len(trajectory)
        for t, (s, a, r) in enumerate(trajectory):
            G_t = sum(gamma**(u-t) * trajectory[u][2] for u in range(t, T))
            self.memory.append((s, a, G_t))

# 사용 예시
memory = JitRLMemory()

# 추론 시
state = "page:shopping/cart filter:price_asc"
candidates = ["click(checkout)", "click(continue_shopping)", "click(apply_coupon)"]
base_logits = {"click(checkout)": 1.2, "click(continue_shopping)": 0.8, "click(apply_coupon)": 0.5}

updated = memory.update_logits(state, candidates, base_logits)
best_action = max(updated, key=updated.get)
print(f"Best action: {best_action}")

# 에피소드 후 메모리 업데이트
traj = [(state, best_action, 1.0)]  # (state, action, reward)
memory.add_trajectory(traj)

Terminology

advantage function어떤 행동이 '평균적인 행동'보다 얼마나 더 좋은지를 나타내는 값. 양수면 평균보다 좋고, 음수면 평균보다 나쁜 행동.

logits모델이 softmax를 거치기 전에 출력하는 날 것의 점수값. 이 값이 클수록 해당 선택지를 고를 확률이 높아짐.

KL divergence두 확률 분포가 얼마나 다른지 측정하는 지표. JitRL에서는 원래 모델의 말하는 방식을 너무 많이 바꾸지 않도록 제약하는 데 사용.

non-parametric memory파라미터(학습 가능한 가중치) 없이 데이터를 그대로 저장해두는 방식. 데이터베이스처럼 경험을 쌓고 검색하는 것.

Jaccard similarity두 집합이 얼마나 겹치는지 측정하는 방법. (겹치는 원소 수) / (전체 원소 수)로 계산. 여기선 두 상태를 단어 단위로 비교.

catastrophic forgetting모델을 새 데이터로 학습시키면 이전에 배운 내용을 갑자기 잊어버리는 현상. 사람이 새 언어를 배우다 모국어를 잊는 것과 유사.

GRPODeepSeek이 만든 강화학습 알고리즘(Group Relative Policy Optimization). PPO를 단순화해서 value network 없이도 정책을 업데이트할 수 있게 한 방법.

WebArena실제 웹사이트(쇼핑몰, 깃랩, 레딧 등)를 시뮬레이션해서 AI 에이전트의 웹 작업 능력을 평가하는 벤치마크.

Related Resources

https://github.com/liushiliushi/JitRL

Original Abstract (Expand)

While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.