Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
TL;DR Highlight
Deployed LLM agents perform real-time reinforcement learning through experience memory alone without gradient updates, achieving higher performance than WebRL at 30x lower cost
Who Should Read
AI engineers running LLM agents for repetitive tasks (web automation, game play, RPA) who want to improve performance without model retraining. Especially developers wanting agents to self-learn from experience while keeping API costs low.
Core Mechanics
- Stores past experiences as <state, action, reward> triplets in a memory bank, retrieves similar situations at inference time to estimate each action's advantage (how much better than average)
- Directly adds advantage estimates to LLM output logits to update policy — parameters are never touched
- Mathematically proven that this logit update formula is the exact closed-form solution to the KL-constrained policy optimization problem
- Applies UCB (uncertainty-based exploration bonus) for unseen actions to auto-balance exploration-exploitation, with bonus decreasing as memory accumulates
- Works backbone-agnostic across Gemini-2.5-flash, GPT-5-mini, DeepSeek-V3.2 etc., with cross-task transfer to never-seen tasks
- After episode completion, LLM Evaluator automatically assigns per-step rewards, solving the credit assignment problem (which action contributed to success)
Evidence
- WebArena-Lite final success rate: JitRL 60.00% vs WebRL (fine-tuned) 46.06% vs SFT 23.0% — inference-only method surpasses fine-tuning
- Cost: JitRL $290 vs WebRL ~$9,900 — 30x+ cheaper (WebRL requires 154 hours on 16 H200 GPUs)
- Jericho Zork1 final score: JitRL 69 vs EvoTest 54 vs GRPO (gradient updates) 10 — dominates even RL-based methods
- WebArena Shopping domain: +73.2% success rate improvement over Static (25.0% → 45.83%)
How to Apply
- In environments where agents repeatedly perform tasks (web automation, customer support bots), store each step as (state, action, discounted_return) triplet, and at next execution, retrieve similar situations via Jaccard similarity to correct logits
- For black-box APIs without logprob access, use 'Verbalized Logit' approach — have the model output 0-100 scores directly, convert to logits, and apply the same method
- Directly modifying logits is more effective than putting retrieval evidence in prompts like RAG — prompt-based methods degrade as context gets longer, but logit modification is unaffected
Code Example
# JitRL Core Logic Snippet (Python Pseudocode)
import numpy as np
from collections import defaultdict
class JitRLMemory:
def __init__(self):
self.memory = [] # List of (state, action, G_t)
def jaccard_similarity(self, s1_tokens: set, s2_tokens: set) -> float:
if not s1_tokens and not s2_tokens:
return 1.0
return len(s1_tokens & s2_tokens) / len(s1_tokens | s2_tokens)
def retrieve_neighbors(self, state: str, k: int = 10) -> list:
state_tokens = set(state.split())
scored = [
(self.jaccard_similarity(state_tokens, set(s.split())), s, a, g)
for s, a, g in self.memory
]
scored.sort(reverse=True)
return scored[:k]
def update_logits(
self,
state: str,
candidate_actions: list,
base_logits: dict, # {action: logit}
beta: float = 1.0,
k: int = 10
) -> dict:
neighbors = self.retrieve_neighbors(state, k)
# State value (baseline)
V_s = np.mean([g for _, _, _, g in neighbors]) if neighbors else 0.0
updated_logits = {}
for action in candidate_actions:
# Action value from matching neighbors
matching = [g for _, _, a, g in neighbors if a == action]
if matching:
Q_sa = np.mean(matching)
else:
# Exploration bonus for unseen actions
Q_sa = V_s + 5.0 / (len(neighbors) + 1e-8)
advantage = Q_sa - V_s
base = base_logits.get(action, 0.0)
updated_logits[action] = base + beta * advantage
return updated_logits
def add_trajectory(self, trajectory: list, gamma: float = 0.9):
# trajectory: [(state, action, reward), ...]
T = len(trajectory)
for t, (s, a, r) in enumerate(trajectory):
G_t = sum(gamma**(u-t) * trajectory[u][2] for u in range(t, T))
self.memory.append((s, a, G_t))
# Usage example
memory = JitRLMemory()
# During inference
state = "page:shopping/cart filter:price_asc"
candidates = ["click(checkout)", "click(continue_shopping)", "click(apply_coupon)"]
base_logits = {"click(checkout)": 1.2, "click(continue_shopping)": 0.8, "click(apply_coupon)": 0.5}
updated = memory.update_logits(state, candidates, base_logits)
best_action = max(updated, key=updated.get)
print(f"Best action: {best_action}")
# Update memory after episode
traj = [(state, best_action, 1.0)] # (state, action, reward)
memory.add_trajectory(traj)Terminology
Related Resources
Original Abstract (Expand)
While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.