RAG

Latest 60 papers on RAG.

Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)
WUPHF builds a shared knowledge base using a Git-based Markdown Wiki, enabling multiple AI agents—including Claude and Codex—to autonomously divide and execute tasks.
Different Language Models Learn Similar Number Representations
LLMs, regardless of architecture—from Transformers to LSTMs—consistently learn periodic patterns with periods T=2, 5, and 10 when representing numbers, mathematically explaining a 'convergent evolution' phenomenon beyond model architecture.
Show HN: Atomic – Local-first, AI-augmented personal knowledge base
Atomic builds a self-hosted, open-source personal knowledge graph app that automatically embeds, tags, and links notes, web clips, and RSS feeds—supporting semantic search, LLM-powered wiki synthesis, and MCP integration.
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
Bayesian Linguistic Belief State surpasses web search performance by a margin exceeding search’s own gains in predictive systems.
GAIA – Open-source framework for building AI agents that run on local hardware
AMD has released GAIA, a Python/C++ framework that allows AI Agents to run on local PCs without the cloud. This approach solves privacy and latency issues, but is also criticized for the realistic limitations of the ROCm ecosystem.
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
A methodology for improving accuracy by having another agent directly explore and synthesize the results investigated simultaneously by multiple AI agents, rather than a simple vote.
Dynamic Context Evolution for Scalable Synthetic Data Generation
A framework that completely eliminates duplication and repetition in large-scale synthetic data generation with LLMs using three mechanisms (VTS + Semantic Memory + Adaptive Prompt).
Show HN: We fingerprinted 178 AI models' writing styles and similarity clusters
This study measured the similarity of writing styles of 178 AI models by analyzing them in 32 dimensions, and found that even among models with significant price differences, over 78% similar writing patterns were discovered.
Show HN: Hippo, biologically inspired memory for AI agents
Hippo is an open-source memory layer that allows you to share memories across sessions between various AI agent tools such as Claude Code, Cursor, and Codex. It implements the brain's mechanisms of memory decay, retrieval strengthening, and consolidation in code.
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
3-13% of cited URLs generated by major LLMs such as GPT-5.1, Gemini, and Claude are non-existent fakes, and urlhealth, an open-source tool, can remove over 99% of them.
We replaced RAG with a virtual filesystem for our AI documentation assistant
Explains how Mintlify overcame RAG chunking limitations by building a virtual filesystem (ChromaFs) on top of Chroma DB that mimics UNIX commands, reducing session boot time from 46 seconds to 100ms.
Lat.md: Agent Lattice: a knowledge graph for your codebase, written in Markdown
A tool that manages design decisions and domain knowledge across a codebase as a graph of interconnected Markdown files, overcoming the limitations of a single AGENTS.md file, enabling AI agents to quickly grasp context without having to traverse the code.
From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
A breakdown of how LLM KV Cache architecture has evolved from GPT-2 to DeepSeek V3, comparing per-token memory costs across architectures as they dropped from 300KB to 69KB.
CERN uses ultra-compact AI models on FPGAs for real-time LHC data filtering
CERN uses a 'hardware-first' inference approach at the LHC by burning PyTorch/TensorFlow models directly into FPGAs to filter hundreds of terabytes of collision data per second at nanosecond latency — a radical departure from conventional GPU/TPU-based AI.
Chroma Context-1: Training a Self-Editing Search Agent
Chroma's newly released 20B parameter agentic search model claims frontier-LLM-level retrieval performance at 1/10 the cost and 10x the speed — though a significant controversy over failure to cite prior work has emerged in the community.
Show HN: Gemini can now natively embed video, so I built sub-second video search
Google's Gemini Embedding model can now embed video directly into vectors without text transcription, enabling natural language search over dashcam footage — describe 'red truck running a stop sign' and get the clip back.
DBAutoDoc: Automated Discovery and Documentation of Undocumented Database Schemas via Statistical Analysis and Iterative LLM Refinement
Automates documentation of legacy dark databases using backpropagation-inspired iterative LLM refinement — 96.1% composite score, $0.70 per 100 tables (99.5% cost reduction)
LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?
A training-free technique (RYS) that duplicates Transformer layers works across all modern LLMs — and reveals that internal representations converge toward a "universal language" independent of human language.
From zero to a RAG system: successes and failures
A hands-on account of building a local LLM-based RAG system from scratch on 1TB of internal technical documentation, honestly sharing the trial and error encountered from data preprocessing to vector indexing.
I built an AI receptionist for a mechanic shop
A dev built an AI receptionist for their brother's auto shop — combining a RAG pipeline with Vapi's voice platform to actually answer phone calls — because missed calls were costing thousands per month.
The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus
Instead of having LLMs write recursive code directly, use deterministic lambda-calculus-based combinators (SPLIT/MAP/FILTER/REDUCE) to process long documents — achieving +21.9% accuracy and 4.1x speed.
Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents
An LLM memory system that compresses conversations into semantic triples, cutting tokens by 95% while maintaining top-tier accuracy.
BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection
Structuring long documents page-by-page and compressing without truncation achieves 26.4x faster compression than LongLLMLingua.
[R] Doc-to-LoRA: Learning to Instantly Internalize Contexts from Sakana AI
Sakana AI D2L — hypernetwork generates LoRA adapter from a document in a single forward pass, sub-second latency, extends context window 5x beyond base model capacity
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Extracting the implicit 3D spatial knowledge learned by video generation models (Wan2.1) to boost MLLM spatial reasoning ability.
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Multilingual embeddings supporting 200 languages without English bias that outperform Qwen3-Embedding at smaller sizes.
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval
Instead of simple topic search in RAG, using a 'hypothesis → 3 targeted queries' approach retrieves documents that actually help select the right answer.
[D] Breaking down MiroThinker H1's verification centric reasoning: why fewer interaction rounds produce better agent performance
MiroThinker H1 verification-centric reasoning: forces agents away from greedy paths — 17% better performance with 43% fewer interaction rounds
Memento-Skills: Let Agents Design Agents
A system where agents self-evolve by accumulating executable 'Skill' files as external memory, without touching LLM parameters
Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory
A memory framework that structures time-based events from conversation history to answer questions like 'what did I do last month?' with 95.6% accuracy
Launch HN: Voygr (YC W26) – A better maps API for agents and AI apps
A place data freshness infrastructure targeting the problem Google Maps API can't solve — 'Is this place actually still open right now?' — aimed at the stale data issues AI agents face when interacting with the real world.
I used Obsidian as a persistent brain for Claude Code and built a full open source tool over a weekend. happy to share the exact setup.
A development workflow sharing how someone used an Obsidian vault as Claude Code's persistent memory to ship an open-source tool in a weekend.
I fed 14 years of daily journals into Claude Code
Someone fed 14 years of journal entries — 5,000 entries total — into Claude Code for pattern analysis and got surprisingly deep insights they didn't expect.
Neuron-Aware Data Selection In Instruction Tuning For Large Language Models
A framework that automatically selects high-quality fine-tuning data by analyzing internal neuron activation patterns in the model.
1M context is now generally available for Opus 4.6 and Sonnet 4.6
Anthropic rolled out 1M token context windows for Opus 4.6 and Sonnet 4.6 — this changes what's practical for long-context tasks.
ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
A benchmark dataset for systematically evaluating and reducing LLM hallucinations when analyzing ESG reports.
Launch HN: Captain (YC W26) – Automated RAG for Files
YC W26 startup Captain auto-builds your entire RAG pipeline from a file upload — no configuration required.
Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation
Compresses AI coding agent conversation histories 11x into searchable memory — with almost no quality loss on vector search.
Long-form RewardBench: Evaluating Reward Models for Long-form Generation
The first evaluation dataset specifically for long-text generation, addressing the gap in existing Reward Model benchmarks that only cover short texts.
DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning
A framework that auto-generates specialized fine-tuning data for finance, medicine, math and more from just a task definition — no human labeling needed.
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
HSL color structure discovered in FLUX.1's latent space — enabling direct color control during generation with no additional training.
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
The MADQA benchmark (800 PDFs, 2,250 questions) shows that even top AI agents can't navigate documents 'strategically' the way humans do.
XSkill: Continual Learning from Experience and Skills in Multimodal Agents
A multimodal agent that keeps getting smarter on its own by accumulating two types of parameter-free memory: past experiences (action-level) and skills (task-level).
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
A technique that speeds up LLM inference up to 14.4x without any training, based on the observation that attention barely changes within a sentence.
BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
Which model to use for zero-shot text classification without labeled data — direct comparison of 38 models across 22 datasets.
CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading
A Human-in-the-Loop grading system that auto-grades only when the LLM is confident, and routes uncertain answers to teachers.
Tiny-Critic RAG: Empowering Agentic Fallback with Parameter-Efficient Small Language Models
Implementing GPT-4o-mini-level RAG noise filtering with a 1.7B small model — 98% cost reduction, 94.6% latency reduction.
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
An uncertainty measurement framework that proactively detects queries where multimodal LLMs are likely to be wrong — without external tools — and auto-routes them to experts or larger models.
Probing for Knowledge Attribution in Large Language Models
A simple linear classifier on LLM internal hidden states can distinguish whether the model used context vs parametric memory with 0.96 F1.
Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI
The ggml.ai team behind llama.cpp has joined Hugging Face, keeping everything open-source — a big deal for the local LLM ecosystem.
Exploiting contextual information to improve stance detection in informal political discourse with LLMs
Adding user profile summaries built from past posts to the prompt boosted political stance classification accuracy by up to 38.5 percentage points.
Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model
LLM context compression quality is determined by data distribution, not model architecture — and the decoder's training data dominates over the encoder's.
Large Language Models for Assisting American College Applications
A practical LLM system architecture paper for US college application assistance, built on RAG + Human-in-the-loop design.
NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems
When wrong retrieval results mix into RAG, LLMs become confidently wrong — this paper fixes it with just 2K data fine-tuning
Hallucination Detection and Mitigation in Large Language Models
A 3-stage framework for systematically detecting and reducing LLM hallucinations by root cause in high-stakes domains like finance and law
TimeCapsuleLLM: LLM trained only on data from 1800-1875
A small language model experiment trained exclusively on early 19th century London texts — testing whether a model can internalize historical language rather than just imitate it.
Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents
A framework that stores and retrieves LLM agent memory based on 'actual event occurrence time' rather than 'conversation date', improving personalization accuracy by up to 12.2%
Over-Searching in Search-Augmented Large Language Models
A systematic study on how LLMs equipped with search tools wastefully repeat searches even for unanswerable questions, driving up costs and error rates.
Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval
A framework that reduces RAG noise by first judging whether retrieval is needed based on LLM uncertainty (instead of always retrieving), then searching via two parallel paths — the original query and a pseudo-document.
O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL
A multi-agent system automatically generates high-quality training data and refines it with RL, building a deep research system that surpasses GPT-5 and OpenAI O3 using open-source models.