LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction

Jan 19, 2026•Yuxing Lu, J. B. Tamo, Weichen Zhao +5•View PDF

TL;DR Highlight

Turns LLMs into RNN-like systems without parameter modification, updating natural-language memory at each step to improve long-sequence prediction accuracy — an inference-only framework

Who Should Read

AI engineers using LLMs for time-series prediction (medical/financial/weather) or long conversation context management. Particularly useful for developers experiencing 'lost-in-the-middle' or error cascading issues with context accumulation.

Core Mechanics

Existing LLMs use append-only context that can't fix past mistakes, but LLM-as-RNN overwrites the system prompt summary based on feedback at each step to correct errors
Operates in a 3-step loop: (1) generate prediction from previous memory + current input → (2) compare with ground truth to generate natural-language feedback → (3) rewrite memory incorporating feedback
Outperforms Full History Concatenation (FHC) which dumps entire history — well-curated compressed memory beats having more information
A small model (Llama-3.2-3B + LLM-as-RNN) beats a 70B model using FHC — the memory update loop partially compensates for model size
Can also use LLM-as-a-Judge for self-feedback without ground truth (slightly lower performance but much better than zero-shot)
Memory is in natural language so humans can read it, making it easy to trace and audit why the model made certain predictions

Evidence

MIMIC-IV clinical dataset Acc@1: 12.6%p improvement over best baseline MemPrompt (0.5175 → 0.6434, Gemma-3-27B)
S&P 500 financial prediction MSE: 6.6% improvement over MemPrompt (4.090 → 3.821, GPT-oss-120B)
Llama-3.1-70B + FHC (Acc@1: 0.4126) beaten by Llama-3.2-3B + LLM-as-RNN (Acc@1: 0.4545) — memory structure reverses 23x parameter gap
After errors, 54.8% probability of self-correction at the next step (remaining 45.2% error persists)

How to Apply

In long conversations or repetitive task agents, replace full history context with a 'compressed memory' section in the system prompt that gets overwritten with feedback each turn
For supervised scenarios (format checking, test pass/fail), use ground truth feedback; for unsupervised cases, construct a self-critique loop using LLM-as-a-Judge
Context budget λ of ~4096 tokens is the performance/cost sweet spot — increasing to 8192 shows diminishing returns, so 4096 is a reasonable default

Code Example

snippet

# LLM-as-RNN core loop example (pseudo-code)

system_memory = ""  # h0: initial memory state

for step, (observation, ground_truth) in enumerate(sequence):
    # Phase 1: Contextualization — predict using memory + current input
    prediction = llm.generate(
        system=f"""You are a sequential prediction AI.

[Learned Memory]
{system_memory}""",
        user=f"Current observation: {observation}\nMake a prediction."
    )

    # Phase 2: Reflection — generate feedback
    if ground_truth:  # supervised mode
        feedback = f"Error: prediction={prediction}, answer={ground_truth}. Analyze what was missed."
    else:  # open-ended mode (LLM-as-a-Judge)
        feedback = llm.generate(
            user=f"Prediction: {prediction}\nQuality criteria: {quality_criteria}\nEvaluate critically."
        )

    # Phase 3: Memory Update — rewrite memory reflecting feedback
    system_memory = llm.generate(
        user=f"""Current memory:
{system_memory}

This step summary: {observation} → {prediction}
Feedback: {feedback}

Update the memory reflecting the above.
Rules: correct wrong patterns, reinforce correct patterns. Within 200 words."""
    )

Terminology

RNNA structure for processing sequential data (sentences, time series) that predicts the next item while remembering previous state. Like remembering prior sentences while writing the next.

LSTMAn improved RNN that addresses the forgetting problem. Controls 'what to remember and what to forget' via dedicated gates.

In-Context LearningHaving a model perform new tasks by including examples in the prompt without changing model parameters. Showing a few examples and the model catches the pattern.

ICLShort for In-Context Learning. Learning new tasks on-the-fly by including examples in the prompt.

Full History Concatenation (FHC)Appending all past observations into the prompt. Simple but exceeds context limits with long history, burying important information.

LLM-as-a-JudgeHaving another LLM evaluate outputs generated by an LLM. Using an LLM as a judge instead of human scoring.

attention dilutionWhen prompts get too long, transformer attention can't focus on important parts and becomes scattered. Like forgetting earlier content when the book is too thick.

semantic gradientA term in this paper where text feedback plays a role analogous to neural network backpropagation gradients. Explaining the error in natural language to guide the next update direction.

Related Resources

Original Abstract (Expand)

Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mechanism that improves predictions for step t+1. We propose LLM-as-RNN, an inference-only framework that turns a frozen LLM into a recurrent predictor by representing its hidden state as natural-language memory. This state, implemented as a structured system-prompt summary, is updated at each timestep via feedback-driven text rewrites, enabling learning without parameter updates. Under a fixed token budget, LLM-as-RNN corrects errors and retains task-relevant patterns, effectively performing online learning through language. We evaluate the method on three sequential benchmarks in healthcare, meteorology, and finance across Llama, Gemma, and GPT model families. LLM-as-RNN significantly outperforms zero-shot, full-history, and MemPrompt baselines, improving predictive accuracy by 6.5% on average, while producing interpretable, human-readable learning traces absent in standard context accumulation.