Google Titans architecture, helping AI have long-term memory
TL;DR Highlight
Google's new architecture handles 2M token contexts without Transformer's quadratic complexity — using 'Titans' memory blocks instead of attention.
Who Should Read
ML researchers working on long-context models, and engineers building RAG or document processing systems who care about context length scaling.
Core Mechanics
- The architecture replaces standard attention with 'Titans' — memory units that store compressed representations of past context and can be retrieved efficiently, avoiding the quadratic cost of full attention over 2M tokens.
- Two types of memory: short-term (sliding window attention for recent tokens) and long-term (Titans persistent memory blocks trained to compress and retrieve key information).
- The long-term memory is a small neural network that learns to compress context — it's trained end-to-end with the main model, not a fixed external retrieval system.
- On benchmarks requiring long-range recall (e.g., needle-in-haystack at 2M tokens), performance is competitive with or better than full attention models at much lower compute cost.
- The architecture has implications for very long document understanding, multi-session memory, and any task where full-context attention is currently prohibitively expensive.
Evidence
- Google published benchmark results showing the Titans architecture matches or beats standard Transformers on long-context tasks while being significantly more efficient.
- HN commenters noted this is in the lineage of SSM/Mamba-style architectures but with a different approach to learned memory — some skepticism about whether benchmark gains hold in practice.
- Several researchers flagged that 'compressing' context into fixed-size memory inherently loses information — the question is whether the compression is smart enough for real tasks.
- The approach was compared favorably to RAG for some use cases, since the memory is learned end-to-end rather than relying on a separate retrieval index.
How to Apply
- If you're hitting context window limits on document processing tasks, watch this architecture closely — it could enable processing entire codebases or document repositories in a single forward pass.
- For multi-turn agents that need long memory without growing context costs, Titans-style architectures may be worth experimenting with as they become available in open-source form.
- RAG architects should benchmark against learned memory approaches for tasks where chunking and retrieval introduce too much information loss.
Code Example
snippet
# Core logic of Surprise-based memory update in Titans architecture (conceptual code)
import torch
import torch.nn as nn
class NeuralMemory(nn.Module):
def __init__(self, dim, memory_lr=0.01):
super().__init__()
# Long-term memory = small MLP (key-value associative memory)
self.memory_mlp = nn.Sequential(
nn.Linear(dim, dim * 2),
nn.SiLU(),
nn.Linear(dim * 2, dim)
)
self.memory_lr = memory_lr # Memory update rate
def compute_surprise(self, query, target):
"""Surprise = magnitude of prediction error (gradient norm)"""
pred = self.memory_mlp(query)
loss = nn.functional.mse_loss(pred, target)
grad = torch.autograd.grad(loss, self.memory_mlp.parameters())
surprise = sum(g.norm() for g in grad) # More surprising = stronger memorization
return surprise, loss
def update_memory(self, key, value, forget_gate):
"""Write surprising information into memory (test-time update)"""
surprise, loss = self.compute_surprise(key, value)
# Forgetting Gate: decay old memories + write new information
effective_lr = self.memory_lr * surprise.item() * forget_gate
for param in self.memory_mlp.parameters():
if param.grad is not None:
param.data -= effective_lr * param.grad
def recall(self, query):
"""Retrieve information from long-term memory using a query"""
return self.memory_mlp(query)
# Example usage with MAC (Memory as Context) variant
# memory_output = neural_memory.recall(current_query)
# context = torch.cat([memory_output, short_term_kv_cache], dim=1)
# output = attention(query, context) # Integration of long-term + short-term memoryTerminology
Titans memoryGoogle's architecture where small learned neural modules compress and store long-range context, replacing the need for full attention over the entire sequence.
Quadratic complexityThe computational cost of standard Transformer attention scales with the square of sequence length — doubling context length quadruples compute.
Sliding window attentionAn attention variant where each token only attends to a fixed-size window of recent tokens, enabling linear cost but losing long-range context.