LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction
TL;DR Highlight
Turns LLMs into RNN-like systems without parameter modification, updating natural-language memory at each step to improve long-sequence prediction accuracy — an inference-only framework
Who Should Read
AI engineers using LLMs for time-series prediction (medical/financial/weather) or long conversation context management. Particularly useful for developers experiencing 'lost-in-the-middle' or error cascading issues with context accumulation.
Core Mechanics
- Existing LLMs use append-only context that can't fix past mistakes, but LLM-as-RNN overwrites the system prompt summary based on feedback at each step to correct errors
- Operates in a 3-step loop: (1) generate prediction from previous memory + current input → (2) compare with ground truth to generate natural-language feedback → (3) rewrite memory incorporating feedback
- Outperforms Full History Concatenation (FHC) which dumps entire history — well-curated compressed memory beats having more information
- A small model (Llama-3.2-3B + LLM-as-RNN) beats a 70B model using FHC — the memory update loop partially compensates for model size
- Can also use LLM-as-a-Judge for self-feedback without ground truth (slightly lower performance but much better than zero-shot)
- Memory is in natural language so humans can read it, making it easy to trace and audit why the model made certain predictions
Evidence
- MIMIC-IV clinical dataset Acc@1: 12.6%p improvement over best baseline MemPrompt (0.5175 → 0.6434, Gemma-3-27B)
- S&P 500 financial prediction MSE: 6.6% improvement over MemPrompt (4.090 → 3.821, GPT-oss-120B)
- Llama-3.1-70B + FHC (Acc@1: 0.4126) beaten by Llama-3.2-3B + LLM-as-RNN (Acc@1: 0.4545) — memory structure reverses 23x parameter gap
- After errors, 54.8% probability of self-correction at the next step (remaining 45.2% error persists)
How to Apply
- In long conversations or repetitive task agents, replace full history context with a 'compressed memory' section in the system prompt that gets overwritten with feedback each turn
- For supervised scenarios (format checking, test pass/fail), use ground truth feedback; for unsupervised cases, construct a self-critique loop using LLM-as-a-Judge
- Context budget λ of ~4096 tokens is the performance/cost sweet spot — increasing to 8192 shows diminishing returns, so 4096 is a reasonable default
Code Example
# LLM-as-RNN core loop example (pseudo-code)
system_memory = "" # h0: initial memory state
for step, (observation, ground_truth) in enumerate(sequence):
# Phase 1: Contextualization — predict using memory + current input
prediction = llm.generate(
system=f"""You are a sequential prediction AI.
[Learned Memory]
{system_memory}""",
user=f"Current observation: {observation}\nMake a prediction."
)
# Phase 2: Reflection — generate feedback
if ground_truth: # supervised mode
feedback = f"Error: prediction={prediction}, answer={ground_truth}. Analyze what was missed."
else: # open-ended mode (LLM-as-a-Judge)
feedback = llm.generate(
user=f"Prediction: {prediction}\nQuality criteria: {quality_criteria}\nEvaluate critically."
)
# Phase 3: Memory Update — rewrite memory reflecting feedback
system_memory = llm.generate(
user=f"""Current memory:
{system_memory}
This step summary: {observation} → {prediction}
Feedback: {feedback}
Update the memory reflecting the above.
Rules: correct wrong patterns, reinforce correct patterns. Within 200 words."""
)Terminology
Related Resources
Original Abstract (Expand)
Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mechanism that improves predictions for step t+1. We propose LLM-as-RNN, an inference-only framework that turns a frozen LLM into a recurrent predictor by representing its hidden state as natural-language memory. This state, implemented as a structured system-prompt summary, is updated at each timestep via feedback-driven text rewrites, enabling learning without parameter updates. Under a fixed token budget, LLM-as-RNN corrects errors and retains task-relevant patterns, effectively performing online learning through language. We evaluate the method on three sequential benchmarks in healthcare, meteorology, and finance across Llama, Gemma, and GPT model families. LLM-as-RNN significantly outperforms zero-shot, full-history, and MemPrompt baselines, improving predictive accuracy by 6.5% on average, while producing interpretable, human-readable learning traces absent in standard context accumulation.