RAG
Latest 60 papers on RAG.
TAHOE: Text-to-SQL with Automated Hint Optimization from Experience
LLM이 SQL 생성 실패에서 배운 힌트를 재사용 가능한 Hint Bank로 쌓아, 모델 재학습 없이 Snowflake 방언 SQL 정확도를 대폭 끌어올리는 시스템.
A €0.01 bank transfer could compromise a banking AI agent
유럽 2위 디지털 뱅크 Bunq의 AI 어시스턴트에서 발견된 간접 프롬프트 인젝션 취약점으로, 단돈 €0.02 송금만으로 사용자에게 피싱 공격을 자동 실행할 수 있었다.
Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models
LLM에 장기 메모리를 붙이면 사용자의 잘못된 믿음까지 기억해서 틀린 답을 내놓는 sycophancy(아첨 현상)가 최대 25배 심해진다.
DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
Baidu가 만든 Deep Research 멀티에이전트 프레임워크로, DAG 기반 동적 플래닝 + 재귀 검색 에이전트 + Rubric 스캐폴딩을 조합해 두 벤치마크에서 SOTA를 달성했다.
Inside FAISS: Billion-Scale Similarity Search
FAISS가 수십억 개 벡터를 빠르게 검색하는 핵심 알고리즘인 IVF(파티셔닝)와 Product Quantization(압축)을 시각적으로 설명한 글로, RAG나 벡터 검색 시스템을 구축하는 개발자에게 내부 동작 원리를 이해시켜 준다.
Show HN: Ktx – Open-source executable context layer for data agents
AI 에이전트가 회사 데이터 웨어하우스를 정확하게 쿼리할 수 있도록 시맨틱 레이어, 메모리, 비즈니스 지식을 자동으로 구축해주는 오픈소스 도구다. 기존 에이전트가 매번 웨어하우스를 재탐색하거나 잘못된 메트릭 로직을 임의로 만들어내는 문제를 해결한다.
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
성공/실패 추론 트레이스를 비교해 짧은 자연어 인사이트를 뽑아내고, 단 5개 학습 샘플로도 GRPO보다 빠르게 모델 추론 성능을 올리는 비파라메트릭 알고리즘.
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems
RAG, Mem0 같은 LLM 메모리 시스템이 왜 틀린 답을 내는지 자동으로 찾아주는 디버깅 프레임워크
6 months of .md memory, conflicting facts are the hard part
AI 에이전트 메모리를 마크다운 파일로 6개월 운영하면서 발견한 지식 충돌 문제와 Telegram 봇으로 사람이 직접 해결하는 에스컬레이션 패턴 소개
Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation
LLM 에이전트의 장기 메모리가 출처를 뒤섞는 문제를 '타입이 있는 메모리 원자' 구조로 해결한 논문
Show HN: Semble – Code search for agents that uses 98% fewer tokens than grep
AI 에이전트가 코드베이스를 탐색할 때 grep+파일 읽기 대신 자연어로 관련 코드 스니펫만 뽑아주는 검색 라이브러리로, 토큰 사용량을 약 98% 줄여준다.
Δ-Mem: Efficient Online Memory for Large Language Models
LLM의 컨텍스트 윈도우를 늘리지 않고도 과거 정보를 효율적으로 기억할 수 있는 경량 메모리 모듈 δ-mem을 제안한 논문. 모델 자체를 바꾸거나 파인튜닝 없이 기존 LLM에 붙여서 장기 기억 성능을 높일 수 있어 에이전트 시스템 개발자에게 관심을 끌고 있다.
How Claude Code works in large codebases
Anthropic이 수백만 줄짜리 모노레포, 레거시 시스템, 수십 개 마이크로서비스 환경에서 Claude Code를 운영한 패턴을 정리한 글이다. RAG 방식 대신 에이전틱 검색을 쓰는 이유와 실제 현장의 한계를 함께 확인할 수 있다.
Show HN: Airbyte Agents – context for agents across multiple data sources
Airbyte가 Slack, Salesforce, Linear 등 여러 SaaS 시스템의 데이터를 미리 인덱싱해서 Agent가 API를 일일이 뒤지지 않아도 되는 Context Store를 출시했다. 기존 MCP 방식보다 토큰을 최대 90%까지 줄이는 효과를 확인했다.
A polynomial autoencoder beats PCA on transformer embeddings
PCA 인코더에 2차 다항식 디코더를 붙여서 닫힌 형태(closed-form)로 embedding 압축 품질을 크게 개선하는 기법으로, SGD 없이 numpy만으로 구현 가능하다.
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
LLM이 웹 검색 같은 외부 도구를 언제 써야 하는지 잘못 판단하고 있으며, 모델 내부 hidden state로 이를 교정할 수 있다.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
RAG 스타일 텍스트 검색 대신 Schema로 정의된 구조화 레코드에 메모리를 저장하면, 정확한 사실 조회·상태 추적·집계 쿼리에서 압도적으로 높은 정확도를 얻을 수 있다.
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
LLM Agent automates incident response, slashing alerts by 75% and resolution times by 50%.
Show HN: A Karpathy-style LLM wiki your agents maintain (Markdown and Git)
WUPHF builds a shared knowledge base using a Git-based Markdown Wiki, enabling multiple AI agents—including Claude and Codex—to autonomously divide and execute tasks.
Different Language Models Learn Similar Number Representations
LLMs, regardless of architecture—from Transformers to LSTMs—consistently learn periodic patterns with periods T=2, 5, and 10 when representing numbers, mathematically explaining a 'convergent evolution' phenomenon beyond model architecture.
Show HN: Atomic – Local-first, AI-augmented personal knowledge base
Atomic builds a self-hosted, open-source personal knowledge graph app that automatically embeds, tags, and links notes, web clips, and RSS feeds—supporting semantic search, LLM-powered wiki synthesis, and MCP integration.
Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
Bayesian Linguistic Belief State surpasses web search performance by a margin exceeding search’s own gains in predictive systems.
GAIA – Open-source framework for building AI agents that run on local hardware
AMD has released GAIA, a Python/C++ framework that allows AI Agents to run on local PCs without the cloud. This approach solves privacy and latency issues, but is also criticized for the realistic limitations of the ROCm ecosystem.
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
A methodology for improving accuracy by having another agent directly explore and synthesize the results investigated simultaneously by multiple AI agents, rather than a simple vote.
Dynamic Context Evolution for Scalable Synthetic Data Generation
A framework that completely eliminates duplication and repetition in large-scale synthetic data generation with LLMs using three mechanisms (VTS + Semantic Memory + Adaptive Prompt).
Show HN: We fingerprinted 178 AI models' writing styles and similarity clusters
This study measured the similarity of writing styles of 178 AI models by analyzing them in 32 dimensions, and found that even among models with significant price differences, over 78% similar writing patterns were discovered.
Show HN: Hippo, biologically inspired memory for AI agents
Hippo is an open-source memory layer that allows you to share memories across sessions between various AI agent tools such as Claude Code, Cursor, and Codex. It implements the brain's mechanisms of memory decay, retrieval strengthening, and consolidation in code.
Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
3-13% of cited URLs generated by major LLMs such as GPT-5.1, Gemini, and Claude are non-existent fakes, and urlhealth, an open-source tool, can remove over 99% of them.
We replaced RAG with a virtual filesystem for our AI documentation assistant
Explains how Mintlify overcame RAG chunking limitations by building a virtual filesystem (ChromaFs) on top of Chroma DB that mimics UNIX commands, reducing session boot time from 46 seconds to 100ms.
Lat.md: Agent Lattice: a knowledge graph for your codebase, written in Markdown
A tool that manages design decisions and domain knowledge across a codebase as a graph of interconnected Markdown files, overcoming the limitations of a single AGENTS.md file, enabling AI agents to quickly grasp context without having to traverse the code.
From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
A breakdown of how LLM KV Cache architecture has evolved from GPT-2 to DeepSeek V3, comparing per-token memory costs across architectures as they dropped from 300KB to 69KB.
CERN uses ultra-compact AI models on FPGAs for real-time LHC data filtering
CERN uses a 'hardware-first' inference approach at the LHC by burning PyTorch/TensorFlow models directly into FPGAs to filter hundreds of terabytes of collision data per second at nanosecond latency — a radical departure from conventional GPU/TPU-based AI.
Chroma Context-1: Training a Self-Editing Search Agent
Chroma's newly released 20B parameter agentic search model claims frontier-LLM-level retrieval performance at 1/10 the cost and 10x the speed — though a significant controversy over failure to cite prior work has emerged in the community.
Show HN: Gemini can now natively embed video, so I built sub-second video search
Google's Gemini Embedding model can now embed video directly into vectors without text transcription, enabling natural language search over dashcam footage — describe 'red truck running a stop sign' and get the clip back.
DBAutoDoc: Automated Discovery and Documentation of Undocumented Database Schemas via Statistical Analysis and Iterative LLM Refinement
Automates documentation of legacy dark databases using backpropagation-inspired iterative LLM refinement — 96.1% composite score, $0.70 per 100 tables (99.5% cost reduction)
LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language?
A training-free technique (RYS) that duplicates Transformer layers works across all modern LLMs — and reveals that internal representations converge toward a "universal language" independent of human language.
From zero to a RAG system: successes and failures
A hands-on account of building a local LLM-based RAG system from scratch on 1TB of internal technical documentation, honestly sharing the trial and error encountered from data preprocessing to vector indexing.
I built an AI receptionist for a mechanic shop
A dev built an AI receptionist for their brother's auto shop — combining a RAG pipeline with Vapi's voice platform to actually answer phone calls — because missed calls were costing thousands per month.
The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus
Instead of having LLMs write recursive code directly, use deterministic lambda-calculus-based combinators (SPLIT/MAP/FILTER/REDUCE) to process long documents — achieving +21.9% accuracy and 4.1x speed.
Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents
An LLM memory system that compresses conversations into semantic triples, cutting tokens by 95% while maintaining top-tier accuracy.
BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection
Structuring long documents page-by-page and compressing without truncation achieves 26.4x faster compression than LongLLMLingua.
[R] Doc-to-LoRA: Learning to Instantly Internalize Contexts from Sakana AI
Sakana AI D2L — hypernetwork generates LoRA adapter from a document in a single forward pass, sub-second latency, extends context window 5x beyond base model capacity
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Extracting the implicit 3D spatial knowledge learned by video generation models (Wan2.1) to boost MLLM spatial reasoning ability.
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Multilingual embeddings supporting 200 languages without English bias that outperform Qwen3-Embedding at smaller sizes.
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval
Instead of simple topic search in RAG, using a 'hypothesis → 3 targeted queries' approach retrieves documents that actually help select the right answer.
[D] Breaking down MiroThinker H1's verification centric reasoning: why fewer interaction rounds produce better agent performance
MiroThinker H1 verification-centric reasoning: forces agents away from greedy paths — 17% better performance with 43% fewer interaction rounds
Memento-Skills: Let Agents Design Agents
A system where agents self-evolve by accumulating executable 'Skill' files as external memory, without touching LLM parameters
Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory
A memory framework that structures time-based events from conversation history to answer questions like 'what did I do last month?' with 95.6% accuracy
Launch HN: Voygr (YC W26) – A better maps API for agents and AI apps
A place data freshness infrastructure targeting the problem Google Maps API can't solve — 'Is this place actually still open right now?' — aimed at the stale data issues AI agents face when interacting with the real world.
I used Obsidian as a persistent brain for Claude Code and built a full open source tool over a weekend. happy to share the exact setup.
A development workflow sharing how someone used an Obsidian vault as Claude Code's persistent memory to ship an open-source tool in a weekend.
I fed 14 years of daily journals into Claude Code
Someone fed 14 years of journal entries — 5,000 entries total — into Claude Code for pattern analysis and got surprisingly deep insights they didn't expect.
Neuron-Aware Data Selection In Instruction Tuning For Large Language Models
A framework that automatically selects high-quality fine-tuning data by analyzing internal neuron activation patterns in the model.
1M context is now generally available for Opus 4.6 and Sonnet 4.6
Anthropic rolled out 1M token context windows for Opus 4.6 and Sonnet 4.6 — this changes what's practical for long-context tasks.
ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
A benchmark dataset for systematically evaluating and reducing LLM hallucinations when analyzing ESG reports.
Launch HN: Captain (YC W26) – Automated RAG for Files
YC W26 startup Captain auto-builds your entire RAG pipeline from a file upload — no configuration required.
Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation
Compresses AI coding agent conversation histories 11x into searchable memory — with almost no quality loss on vector search.
Long-form RewardBench: Evaluating Reward Models for Long-form Generation
The first evaluation dataset specifically for long-text generation, addressing the gap in existing Reward Model benchmarks that only cover short texts.
DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning
A framework that auto-generates specialized fine-tuning data for finance, medicine, math and more from just a task definition — no human labeling needed.
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
HSL color structure discovered in FLUX.1's latent space — enabling direct color control during generation with no additional training.
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
The MADQA benchmark (800 PDFs, 2,250 questions) shows that even top AI agents can't navigate documents 'strategically' the way humans do.