Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

Jan 26, 2026•Yibo Li, Zijie Lin, Ailin Deng +5•View PDF

TL;DR Highlight

Deployed LLM agents perform real-time reinforcement learning through experience memory alone without gradient updates, achieving higher performance than WebRL at 30x lower cost

Who Should Read

AI engineers running LLM agents for repetitive tasks (web automation, game play, RPA) who want to improve performance without model retraining. Especially developers wanting agents to self-learn from experience while keeping API costs low.

Core Mechanics

Stores past experiences as <state, action, reward> triplets in a memory bank, retrieves similar situations at inference time to estimate each action's advantage (how much better than average)
Directly adds advantage estimates to LLM output logits to update policy — parameters are never touched
Mathematically proven that this logit update formula is the exact closed-form solution to the KL-constrained policy optimization problem
Applies UCB (uncertainty-based exploration bonus) for unseen actions to auto-balance exploration-exploitation, with bonus decreasing as memory accumulates
Works backbone-agnostic across Gemini-2.5-flash, GPT-5-mini, DeepSeek-V3.2 etc., with cross-task transfer to never-seen tasks
After episode completion, LLM Evaluator automatically assigns per-step rewards, solving the credit assignment problem (which action contributed to success)

Evidence

WebArena-Lite final success rate: JitRL 60.00% vs WebRL (fine-tuned) 46.06% vs SFT 23.0% — inference-only method surpasses fine-tuning
Cost: JitRL $290 vs WebRL ~$9,900 — 30x+ cheaper (WebRL requires 154 hours on 16 H200 GPUs)
Jericho Zork1 final score: JitRL 69 vs EvoTest 54 vs GRPO (gradient updates) 10 — dominates even RL-based methods
WebArena Shopping domain: +73.2% success rate improvement over Static (25.0% → 45.83%)

How to Apply

In environments where agents repeatedly perform tasks (web automation, customer support bots), store each step as (state, action, discounted_return) triplet, and at next execution, retrieve similar situations via Jaccard similarity to correct logits
For black-box APIs without logprob access, use 'Verbalized Logit' approach — have the model output 0-100 scores directly, convert to logits, and apply the same method
Directly modifying logits is more effective than putting retrieval evidence in prompts like RAG — prompt-based methods degrade as context gets longer, but logit modification is unaffected

Code Example

snippet

# JitRL Core Logic Snippet (Python Pseudocode)
import numpy as np
from collections import defaultdict

class JitRLMemory:
    def __init__(self):
        self.memory = []  # List of (state, action, G_t)
    
    def jaccard_similarity(self, s1_tokens: set, s2_tokens: set) -> float:
        if not s1_tokens and not s2_tokens:
            return 1.0
        return len(s1_tokens & s2_tokens) / len(s1_tokens | s2_tokens)
    
    def retrieve_neighbors(self, state: str, k: int = 10) -> list:
        state_tokens = set(state.split())
        scored = [
            (self.jaccard_similarity(state_tokens, set(s.split())), s, a, g)
            for s, a, g in self.memory
        ]
        scored.sort(reverse=True)
        return scored[:k]
    
    def update_logits(
        self,
        state: str,
        candidate_actions: list,
        base_logits: dict,  # {action: logit}
        beta: float = 1.0,
        k: int = 10
    ) -> dict:
        neighbors = self.retrieve_neighbors(state, k)
        
        # State value (baseline)
        V_s = np.mean([g for _, _, _, g in neighbors]) if neighbors else 0.0
        
        updated_logits = {}
        for action in candidate_actions:
            # Action value from matching neighbors
            matching = [g for _, _, a, g in neighbors if a == action]
            if matching:
                Q_sa = np.mean(matching)
            else:
                # Exploration bonus for unseen actions
                Q_sa = V_s + 5.0 / (len(neighbors) + 1e-8)
            
            advantage = Q_sa - V_s
            base = base_logits.get(action, 0.0)
            updated_logits[action] = base + beta * advantage
        
        return updated_logits
    
    def add_trajectory(self, trajectory: list, gamma: float = 0.9):
        # trajectory: [(state, action, reward), ...]
        T = len(trajectory)
        for t, (s, a, r) in enumerate(trajectory):
            G_t = sum(gamma**(u-t) * trajectory[u][2] for u in range(t, T))
            self.memory.append((s, a, G_t))

# Usage example
memory = JitRLMemory()

# During inference
state = "page:shopping/cart filter:price_asc"
candidates = ["click(checkout)", "click(continue_shopping)", "click(apply_coupon)"]
base_logits = {"click(checkout)": 1.2, "click(continue_shopping)": 0.8, "click(apply_coupon)": 0.5}

updated = memory.update_logits(state, candidates, base_logits)
best_action = max(updated, key=updated.get)
print(f"Best action: {best_action}")

# Update memory after episode
traj = [(state, best_action, 1.0)]  # (state, action, reward)
memory.add_trajectory(traj)

Terminology

advantage functionA value showing how much better an action is compared to the 'average action.' Positive means better than average, negative means worse.

logitsRaw scores output by the model before softmax. Higher values mean higher probability of selecting that option.

KL divergenceA metric measuring how different two probability distributions are. In JitRL, used to constrain not changing the model's speaking style too much.

non-parametric memoryA method storing data as-is without learnable parameters (weights). Like a database that accumulates and retrieves experiences.

Jaccard similarityA method measuring how much two sets overlap. Calculated as (overlapping elements) / (total elements). Here, comparing two states word-by-word.

catastrophic forgettingWhen training a model on new data causes it to suddenly forget previously learned content. Like forgetting your native language while learning a new one.

GRPOGroup Relative Policy Optimization by DeepSeek. Simplifies PPO to update policy without a value network.

WebArenaA benchmark evaluating AI agents' web task capabilities by simulating real websites (shopping malls, GitLab, Reddit, etc.).

Related Resources

https://github.com/liushiliushi/JitRL

Original Abstract (Expand)

While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.