From zero to a RAG system: successes and failures
TL;DR Highlight
A hands-on account of building a local LLM-based RAG system from scratch on 1TB of internal technical documentation, honestly sharing the trial and error encountered from data preprocessing to vector indexing.
Who Should Read
Backend/ML engineers looking to adopt a RAG system for the first time for internal document search or knowledge base construction, especially developers who feel lost about where to start when dealing with large volumes of unstructured documents.
Core Mechanics
- The requirements seemed simple but turned out to be highly complex: the system needed to support natural language Q&A over 10 years of company projects, deliver fast response times, avoid external APIs (for security), and even cover specialized simulation files like OrcaFlex. The given data was a 1TB Azure folder containing thousands of files in a jumbled mix of formats and structures.
- Technology stack decisions: Ollama for local LLM execution, nomic-embed-text as the embedding model (chosen for its strong quality on technical documents), and the open-source framework LlamaIndex for RAG orchestration (document indexing, embedding generation, vector DB storage and querying). Python was used as the language, and the initial prototype worked well with minimal code.
- The first major hurdle was memory explosion. When processing real documents with LlamaIndex, multi-gigabyte simulation files and videos were being loaded entirely into memory as if they were plain text, crashing the laptop. The fix was adding a filtering pipeline by file extension and filename pattern, excluding dozens of extensions such as mp4, exe, zip, .sim, .dat, and backup files from processing.
- Even after file filtering, problems persisted. Scanned PDFs with no extractable text, CAD files hundreds of MB in size, and encrypted documents blocked the pipeline. The solution was setting a file size limit and adding exception handling to skip files when text extraction failed.
- The indexing job itself became a long-term project. The laptop proved insufficient, so the work was moved to a Hetzner cloud VM, and indexing completed only after running continuously for 2–3 weeks. Total cost was €184—commenters widely noted this was 'a drop in the bucket compared to three weeks of labor.'
- Chunking strategy (splitting documents into appropriately sized pieces) had a major impact on retrieval quality. Chunks that are too large mix in irrelevant content; chunks that are too small lose context. Specialized format files like OrcaFlex were difficult to chunk meaningfully with standard text parsers and required separate handling.
Evidence
- "Experienced RAG practitioners strongly resonated with the author's conclusion that 'data preprocessing is the key.' One commenter shared that they spent a week building an ETL with separate SQL and graph DBs just to extract pricing information from a single 20-page PDF with 100% accuracy, criticizing the naive approach of simply converting PDFs to markdown with docling and dumping them into a vector DB as 'absurd.' A counterargument emerged against the 'RAG is dead' claim—the notion that growing LLM context windows make RAG unnecessary was rebutted with the observation that 'even if all of Lord of the Rings fits in context, an entire law library, all of Wikipedia, or 451GB of data like in this post is still many times larger.' Conversely, one commenter viewed RAG as 'AI Lite' or 'AI-adjacent technology,' predicting that once context windows grow large enough, RAG's deterministic approach will limit LLMs' reasoning capabilities. There was also regret expressed over not using re-ranking: one commenter noted that 'it's a shame you used a good embedding model but skipped the re-ranker—with a re-ranker you can get away with a smaller, cheaper embedding model and even use smaller embedding vectors, which is actually an advantage.' Re-ranking is a technique where candidate documents retrieved via vector search are re-scored by an LLM for relevance. More advanced suggestions included structured data preprocessing and ReAG (Reasoning-based RAG): one commenter advised that 'instead of dumping unstructured data into a vector DB, doing basic preprocessing and labeling and embedding with different schemas dramatically improves retrieval quality and flexibility—adding a memory knowledge graph on top enables context that updates over time rather than static documents.' There was also an interesting perspective on the €184 VM cost: 'In a project where you've poured in three weeks of labor, €184 is just pocket change' garnered widespread agreement. A question about whether a RAG tool exists that works like SQLite—as a single file with no backend—also surfaced, reflecting many developers' desire for simpler RAG solutions."
How to Apply
- "If you are building an internal document RAG system for the first time, build the data filtering pipeline before choosing an embedding model. Simply filtering out files that cannot yield meaningful text—such as mp4, exe, zip, backup files, and simulation output files—by extension and filename pattern alone can prevent a large share of memory explosion and pipeline crash issues. For rapid prototyping of a local LLM-based RAG system, consider starting with the Ollama + LlamaIndex + nomic-embed-text stack. This combination is great for building a working prototype with minimal code, but you must add file size limits and exception handling before feeding in real data. For large-scale document sets where indexing will take weeks or more, plan from the outset to run the job on a cloud VM like Hetzner rather than a laptop. VM costs are negligible compared to labor costs, and the ability to run uninterrupted for extended periods makes the process far more stable. To improve retrieval quality, adding a re-ranking stage to your pipeline may be more effective than swapping out the embedding model. The approach is to retrieve Top-K candidates via vector search and then have a re-ranker re-score them by actual relevance to the query—allowing you to use a smaller, cheaper embedding model in the first place."
Terminology
Related Papers
Show HN: Bible as RAG Database
성경 전체를 RAG(검색 증강 생성) 데이터베이스로 인덱싱해 주제나 키워드로 관련 성경 구절을 의미론적으로 검색할 수 있는 웹 서비스다. 종교 텍스트에 RAG를 적용한 실용적 예시로, 유사한 프로젝트를 만들려는 개발자에게 참고가 된다.
Haystack: Open-Source AI Framework for Production Ready Agents, RAG
deepset이 만든 오픈소스 AI 오케스트레이션 프레임워크로, LangChain의 대안으로 주목받고 있으며 모듈형 파이프라인 방식으로 RAG·Agent·멀티모달 앱을 프로덕션까지 구축할 수 있다.
We built a persistent agent memory layer on Elasticsearch with 0.89 recall
AI 에이전트가 세션이 끝나도 사용자 정보를 기억할 수 있도록 Elasticsearch 위에 구축한 멀티테넌트 장기 메모리 시스템 아키텍처 공개. 168개 질문 기준 R@10 0.89, 테넌트 간 데이터 누출 0건을 달성한 구체적인 구현 방법을 담았다.
TAHOE: Text-to-SQL with Automated Hint Optimization from Experience
LLM이 SQL 생성 실패에서 배운 힌트를 재사용 가능한 Hint Bank로 쌓아, 모델 재학습 없이 Snowflake 방언 SQL 정확도를 대폭 끌어올리는 시스템.
Inside FAISS: Billion-Scale Similarity Search
FAISS가 수십억 개 벡터를 빠르게 검색하는 핵심 알고리즘인 IVF(파티셔닝)와 Product Quantization(압축)을 시각적으로 설명한 글로, RAG나 벡터 검색 시스템을 구축하는 개발자에게 내부 동작 원리를 이해시켜 준다.
Show HN: Airbyte Agents – context for agents across multiple data sources
Airbyte가 Slack, Salesforce, Linear 등 여러 SaaS 시스템의 데이터를 미리 인덱싱해서 Agent가 API를 일일이 뒤지지 않아도 되는 Context Store를 출시했다. 기존 MCP 방식보다 토큰을 최대 90%까지 줄이는 효과를 확인했다.