From zero to a RAG system: successes and failures
TL;DR Highlight
A hands-on account of building a local LLM-based RAG system from scratch on 1TB of internal technical documentation, honestly sharing the trial and error encountered from data preprocessing to vector indexing.
Who Should Read
Backend/ML engineers looking to adopt a RAG system for the first time for internal document search or knowledge base construction, especially developers who feel lost about where to start when dealing with large volumes of unstructured documents.
Core Mechanics
- The requirements seemed simple but turned out to be highly complex: the system needed to support natural language Q&A over 10 years of company projects, deliver fast response times, avoid external APIs (for security), and even cover specialized simulation files like OrcaFlex. The given data was a 1TB Azure folder containing thousands of files in a jumbled mix of formats and structures.
- Technology stack decisions: Ollama for local LLM execution, nomic-embed-text as the embedding model (chosen for its strong quality on technical documents), and the open-source framework LlamaIndex for RAG orchestration (document indexing, embedding generation, vector DB storage and querying). Python was used as the language, and the initial prototype worked well with minimal code.
- The first major hurdle was memory explosion. When processing real documents with LlamaIndex, multi-gigabyte simulation files and videos were being loaded entirely into memory as if they were plain text, crashing the laptop. The fix was adding a filtering pipeline by file extension and filename pattern, excluding dozens of extensions such as mp4, exe, zip, .sim, .dat, and backup files from processing.
- Even after file filtering, problems persisted. Scanned PDFs with no extractable text, CAD files hundreds of MB in size, and encrypted documents blocked the pipeline. The solution was setting a file size limit and adding exception handling to skip files when text extraction failed.
- The indexing job itself became a long-term project. The laptop proved insufficient, so the work was moved to a Hetzner cloud VM, and indexing completed only after running continuously for 2–3 weeks. Total cost was €184—commenters widely noted this was 'a drop in the bucket compared to three weeks of labor.'
- Chunking strategy (splitting documents into appropriately sized pieces) had a major impact on retrieval quality. Chunks that are too large mix in irrelevant content; chunks that are too small lose context. Specialized format files like OrcaFlex were difficult to chunk meaningfully with standard text parsers and required separate handling.
Evidence
- "Experienced RAG practitioners strongly resonated with the author's conclusion that 'data preprocessing is the key.' One commenter shared that they spent a week building an ETL with separate SQL and graph DBs just to extract pricing information from a single 20-page PDF with 100% accuracy, criticizing the naive approach of simply converting PDFs to markdown with docling and dumping them into a vector DB as 'absurd.' A counterargument emerged against the 'RAG is dead' claim—the notion that growing LLM context windows make RAG unnecessary was rebutted with the observation that 'even if all of Lord of the Rings fits in context, an entire law library, all of Wikipedia, or 451GB of data like in this post is still many times larger.' Conversely, one commenter viewed RAG as 'AI Lite' or 'AI-adjacent technology,' predicting that once context windows grow large enough, RAG's deterministic approach will limit LLMs' reasoning capabilities. There was also regret expressed over not using re-ranking: one commenter noted that 'it's a shame you used a good embedding model but skipped the re-ranker—with a re-ranker you can get away with a smaller, cheaper embedding model and even use smaller embedding vectors, which is actually an advantage.' Re-ranking is a technique where candidate documents retrieved via vector search are re-scored by an LLM for relevance. More advanced suggestions included structured data preprocessing and ReAG (Reasoning-based RAG): one commenter advised that 'instead of dumping unstructured data into a vector DB, doing basic preprocessing and labeling and embedding with different schemas dramatically improves retrieval quality and flexibility—adding a memory knowledge graph on top enables context that updates over time rather than static documents.' There was also an interesting perspective on the €184 VM cost: 'In a project where you've poured in three weeks of labor, €184 is just pocket change' garnered widespread agreement. A question about whether a RAG tool exists that works like SQLite—as a single file with no backend—also surfaced, reflecting many developers' desire for simpler RAG solutions."
How to Apply
- "If you are building an internal document RAG system for the first time, build the data filtering pipeline before choosing an embedding model. Simply filtering out files that cannot yield meaningful text—such as mp4, exe, zip, backup files, and simulation output files—by extension and filename pattern alone can prevent a large share of memory explosion and pipeline crash issues. For rapid prototyping of a local LLM-based RAG system, consider starting with the Ollama + LlamaIndex + nomic-embed-text stack. This combination is great for building a working prototype with minimal code, but you must add file size limits and exception handling before feeding in real data. For large-scale document sets where indexing will take weeks or more, plan from the outset to run the job on a cloud VM like Hetzner rather than a laptop. VM costs are negligible compared to labor costs, and the ability to run uninterrupted for extended periods makes the process far more stable. To improve retrieval quality, adding a re-ranking stage to your pipeline may be more effective than swapping out the embedding model. The approach is to retrieve Top-K candidates via vector search and then have a re-ranker re-score them by actual relevance to the query—allowing you to use a smaller, cheaper embedding model in the first place."
Terminology
RAGShort for Retrieval-Augmented Generation. A method where an LLM retrieves relevant content from pre-indexed documents to reference when generating a response—similar to looking up relevant pages in an open-book exam.
임베딩(Embedding)The process of converting text into numerical vectors. Sentences with similar meanings are placed close together in vector space, enabling similarity-based search.
청킹(Chunking)The task of splitting long documents into pieces of a size suitable for retrieval. There is a trade-off: chunks that are too large introduce noise, while chunks that are too small lose context.
재랭킹(Re-ranking)The process of first retrieving a set of candidate documents via vector search, then having a separate model re-score and reorder them by actual relevance to the query. Effective for improving retrieval accuracy.
nomic-embed-textAn open-source text embedding model developed by Nomic AI. It can run locally and is known for strong performance on technical documents.
LlamaIndexA Python-based RAG orchestration framework that bundles document loading, chunking, embedding generation, vector DB storage, and querying into a single pipeline.