16 min read

Local RAG for Document Processing: Architecture Guide

A technical guide to building local RAG pipelines for document processing -- covering ingestion, chunking, embedding, retrieval, and when direct file reading outperforms traditional RAG.


The retrieval problem in document processing

You have 500 contracts in a folder. A user asks: "Which contracts have a non-compete clause longer than two years?" The language model cannot read 500 files at once. Its context window, however large, has a limit. You need a way to find the relevant documents -- or the relevant sections of documents -- and present them to the model for analysis.

This is the retrieval problem. Retrieval-Augmented Generation (RAG) is the dominant approach to solving it. But RAG is not one thing. It is a pipeline with multiple stages, each with its own trade-offs. And for document processing specifically, the traditional RAG pipeline is not always the best approach.

This article covers the RAG architecture, what changes when you run it locally instead of in the cloud, and why some document processing tasks are better served by an agent that reads files directly rather than querying an index.

What RAG is and how it works

RAG combines retrieval with generation. Instead of asking the model to answer from its training data alone, you first retrieve relevant information from an external source and include it in the model's prompt. The model generates its answer using both its general knowledge and the specific retrieved content.

The standard RAG pipeline has six stages:

Ingest. Documents are loaded from their source: file system, database, API, or document management system. Each document is read and its content is extracted. For simple text files, this is trivial. For structured formats like DOCX, XLSX, or PDF, this requires parsing to extract the actual text content from the file format.

Chunk. Documents are split into smaller pieces. A 50-page contract becomes hundreds of chunks, each containing a paragraph, a section, or a fixed number of tokens. Chunking is necessary because embedding models have input limits, and because retrieving a whole document when only one paragraph is relevant wastes context window space.

Embed. Each chunk is converted into a vector -- a high-dimensional numerical representation that captures the semantic meaning of the text. This is done by an embedding model, which is a neural network trained to produce similar vectors for semantically similar text. "The tenant shall pay rent monthly" and "Monthly rent payments are required" produce similar vectors despite using different words.

Index. The vectors are stored in a searchable index. When a query comes in, the index finds the vectors most similar to the query vector. This is the core data structure of the RAG system. It can be a specialized vector database, a flat file of vectors with brute-force search, or a SQLite database with a vector extension.

Retrieve. When the user asks a question, the question is embedded using the same embedding model. The index is searched for the chunks whose vectors are most similar to the query vector. The top-k most similar chunks are returned.

Generate. The retrieved chunks are included in the model's prompt, along with the user's question. The model generates an answer grounded in the retrieved content. Because the model has the relevant text in its context, it can provide specific, accurate answers rather than general ones.

Each stage introduces complexity and potential failure. Bad chunking means relevant content is split across chunks and never retrieved together. Bad embeddings mean semantically similar content is not recognized as similar. A weak index means slow retrieval or missed results. These are not theoretical concerns -- they are the daily engineering challenges of any production RAG system.

Local RAG vs cloud RAG

Most RAG tutorials and products assume a cloud architecture: documents stored in a cloud bucket, embeddings generated by a cloud API, vectors stored in a managed vector database, generation handled by a cloud LLM. This works, but it comes with trade-offs that are particularly significant for document processing.

Privacy. Cloud RAG means your documents, their embeddings, and their chunks all live on someone else's servers. For legal contracts, financial reports, HR documents, or medical records, this may be unacceptable. Local RAG keeps everything on the user's machine.

Latency. Cloud RAG adds network round-trips at every stage. Embedding a query requires an API call. Retrieving from a cloud vector database requires another. These add 100-500ms per stage. Local RAG runs the embedding model and the index search on the same machine, with latency measured in milliseconds for the retrieval portion.

Cost. Cloud embedding APIs charge per token. Embedding 500 contracts at ingest time and then re-embedding every query adds up. Local embedding models are free to run after the initial download, though they use local compute resources.

Index size. Cloud vector databases can store billions of vectors. Local storage is limited by the user's disk and RAM. For most document processing use cases -- hundreds to low thousands of documents -- local storage is sufficient. For enterprise-scale collections with millions of documents, cloud databases are necessary.

Model quality. Cloud embedding APIs offer the best embedding models -- large parameter counts, trained on massive datasets, producing high-quality vectors. Local embedding models are smaller and generally less accurate. The quality gap has narrowed significantly, but it exists.

Offline operation. Local RAG works without an internet connection (for the retrieval portion -- the generation step still needs an LLM, which may be cloud-hosted). Cloud RAG requires connectivity for every operation.

The trade-off summary: local RAG offers better privacy, lower latency, and no per-query cost, at the expense of embedding quality and index scale. For document processing on personal or team-sized collections, local RAG is often the better fit.

Embedding models for documents

The embedding model is the most critical component of a RAG pipeline. It determines whether semantically similar content is recognized as similar. A bad embedding model means bad retrieval, and no amount of engineering elsewhere in the pipeline can compensate.

For local deployment, you need embedding models that run on consumer hardware -- CPU-only or with modest GPU requirements. The landscape has improved substantially:

Small models (under 500MB) like MiniLM variants produce 384-dimensional vectors. Fast and lightweight, they run on any modern laptop. Quality is acceptable for general text but degrades on specialized vocabulary -- legal terms, medical terminology, financial jargon.

Medium models (500MB-2GB) like E5, BGE, or GTE variants produce 768 or 1024-dimensional vectors. Better quality, especially on longer passages. Many are multilingual. They run well on CPU with latency of 10-50ms per chunk.

Large models (2GB+) approach cloud API quality but require significant compute. On a modern laptop CPU, embedding a single chunk might take 100-500ms -- fine for query-time embedding but slow for batch ingestion.

Document-specific considerations. Documents contain tables, headers, numbered lists, cross-references, and domain vocabulary that embedding models trained on web text may not handle well. A table that says "Payment: $50,000 | Due: 30 days" is semantically rich but structurally different from prose. Test on your specific document types.

Chunking interacts with embedding quality. Too long and the embedding becomes a vague average of multiple topics. Too short and the embedding captures a fragment without enough context. The optimal chunk size depends on the embedding model, document structure, and query patterns -- 256 to 512 tokens is a common starting point, but domain-specific tuning is always necessary.

Index storage for local RAG

Where you store your vectors matters more than it seems. The index is the data structure that enables fast similarity search, and the choice of storage backend affects performance, complexity, and portability.

Flat files (FAISS, NumPy arrays). Store all vectors in a flat file, search by brute-force cosine similarity. For small collections (under 10,000 vectors), this is surprisingly effective. It does not scale -- brute-force on a million vectors is too slow.

SQLite with vector extensions. SQLite-vec adds vector similarity search to SQLite. Attractive for local RAG because SQLite requires no server and stores everything in a single file. Queries can combine vector similarity with metadata filters ("find similar chunks but only from contracts signed after 2024").

Specialized vector databases (Chroma, Qdrant, Milvus). Better performance on large collections with ANN search and richer query APIs. The cost is complexity: they require a running server process. For a desktop application, that is a significant UX burden.

In-memory indexes (HNSW). Hierarchical Navigable Small World graphs offer sub-millisecond search on hundreds of thousands of vectors. The trade-off is memory usage -- the entire index must fit in RAM.

For local document processing, the practical choice is usually SQLite with a vector extension or a flat file index. No external services, works on any platform, handles personal and team-scale collections.

Retrieval quality: precision and recall

Retrieval is not just about finding something. It is about finding the right things and not finding the wrong things.

Precision measures how many of the retrieved results are actually relevant. If you retrieve 10 chunks and 7 are relevant, precision is 70%. Low precision means the model's context is polluted with irrelevant information, which wastes context window space and can lead to incorrect answers.

Recall measures how many of the relevant items were successfully retrieved. If there are 20 relevant chunks in the collection and your retrieval returns 7 of them, recall is 35%. Low recall means the model is missing information it needs to produce a complete answer.

There is a tension between precision and recall. Retrieving more results (higher k) improves recall but typically decreases precision. Retrieving fewer results improves precision but risks missing important information.

For document processing, the consequences of low recall are often worse than low precision. Missing a relevant contract clause is more damaging than including an irrelevant one. The model can usually ignore irrelevant retrieved chunks, but it cannot reason about information it never received.

Several factors affect retrieval quality in document collections. Query-document mismatch is common: "Can the landlord terminate early?" needs to match "The lessor reserves the right to end this agreement prior to the stated term." Multi-hop reasoning requires information from non-adjacent chunks -- triggering conditions in one section, liability amounts in another. Table content is challenging because table rows depend on column headers for meaning, and chunking may separate them. Cross-document questions like "which contract has the longest warranty period?" require finding similar clauses across all documents, not just in one.

These challenges are particularly acute in document processing because documents are more structured, more varied, and more specialized than the web text most retrieval systems are optimized for.

When RAG helps and when it does not

RAG is not universally the right approach. Its value depends on the size of the document collection and the nature of the query.

RAG helps when:

The collection is large. Hundreds or thousands of documents exceed any model's context window. You need retrieval to find the relevant subset. Without RAG, you cannot even present the relevant information to the model.

Queries are specific. "Find all indemnification clauses" is a retrieval task. The user wants to locate specific content across a large collection. RAG's similarity search is well-suited to this.

The relevant information is concentrated. If the answer to the question is in one paragraph of one document, RAG is efficient: find that paragraph, present it to the model, get the answer. The rest of the collection is irrelevant and correctly excluded.

RAG does not help when:

The collection is small. If you have 5 documents that fit in the model's context window, there is no retrieval problem. Send the full documents to the model and let it reason about the complete content. RAG adds complexity without benefit.

The question requires holistic understanding. "Summarize the themes across all these reports" requires reading all the reports, not finding the most similar chunks. RAG retrieves fragments, not overviews.

The question requires structure-aware reasoning. "Is the total in the summary table consistent with the line items?" requires reading the table structure, understanding relationships between cells, and performing calculations. This is not a retrieval task.

The relevant information is distributed. If answering the question requires synthesizing information from many parts of many documents, RAG's top-k retrieval may not return enough chunks from enough documents to cover the answer space.

The hybrid approach

The most effective document processing systems combine RAG with other approaches rather than relying on RAG alone.

RAG for discovery, full-context for analysis. Use RAG to find which documents are relevant to the user's question. Then load those full documents into the model's context for detailed analysis. This gives you the efficiency of retrieval for large collections and the accuracy of full-context reasoning for the actual analysis.

Chunked retrieval plus contextual expansion. Retrieve the most relevant chunks, then expand each chunk to include its surrounding context -- the preceding and following paragraphs, the section headers, the full table rather than a single row. This mitigates the chunking problem without loading entire documents.

Iterative retrieval. Instead of one retrieve-then-generate cycle, let the agent retrieve, analyze, and retrieve again based on what it learned. The first retrieval might find relevant sections. The model identifies that it needs more context about a specific term. A second retrieval focuses on that term. This is closer to how a human researcher works -- you do not read everything once and then answer. You search, read, refine your search, and read more.

How docrew approaches this differently

docrew does not use a traditional RAG pipeline for document processing. Instead, it takes a direct approach: the agent reads files from the file system, uses its tools to navigate the document collection, and relies on its reasoning capabilities to determine what is relevant.

Here is how that works in practice. When a user asks about a collection of documents, the agent uses file_list to see what is in the folder. It reads file names and metadata to form an initial understanding of the collection. Then it uses file_read and file_grep to navigate the content -- reading specific files, searching for terms across the collection, and iterating based on what it finds.

This is closer to how a human works with documents. You do not embed all your contracts into a vector store before you can answer a question about them. You look at the folder, open the relevant files, search for the terms you care about, and read the sections that matter.

The advantages of this approach for document processing:

No ingest step. There is no upfront cost to index the collection. The agent works with files as they are on disk. Add a new document to the folder and the agent can work with it immediately, without waiting for it to be chunked, embedded, and indexed.

Structure preservation. The agent reads the full document with its structure intact -- headings, tables, sections, numbering. It does not work with decontextualized chunks that have lost their position in the document hierarchy.

Flexible navigation. The agent decides how to navigate the collection based on the specific question. For "find all payment terms," it might grep for "payment" across all files and then read the relevant sections. For "compare these two contracts," it reads both in full. The navigation strategy adapts to the task.

No embedding model dependency. There is no need to choose, deploy, and maintain an embedding model. No concerns about embedding quality on domain-specific vocabulary. No vector index to store and update.

The trade-offs are real. This approach is slower on very large collections because the agent reads files sequentially or in parallel rather than querying a pre-built index. For a collection of 10,000 documents, a RAG index would return relevant chunks in milliseconds, while the agent would need time to search and read. The break-even point depends on the query type, the collection size, and how much of the collection is relevant to any given question.

For the collection sizes that most individuals and teams work with -- dozens to low hundreds of documents -- direct file access is faster end-to-end because it eliminates the ingest and indexing overhead. You get answers on the first query, not after a multi-minute indexing step.

Building a local RAG pipeline: practical considerations

If your use case calls for a local RAG pipeline, a few practical principles apply.

Start with the retrieval, not the generation. Build and test the retrieval pipeline first. Verify that the right chunks are returned for representative queries before connecting to a language model. If retrieval is poor, generation will be poor.

Test chunking empirically. Split a representative document at different sizes (128, 256, 512, 1024 tokens), embed the chunks, run representative queries, and measure which size produces the best results. The answer varies by document type and query pattern.

Invest in metadata. Store document name, section heading, page number, and date alongside vectors. This enables filtered retrieval and helps the model contextualize content.

Handle updates. When documents change, their chunks and embeddings need regeneration. Build a mechanism to detect changes (file timestamps, content hashes) and re-index selectively.

Measure retrieval quality. Build a test set of questions with known relevant chunks. Measure precision and recall. Subjective evaluation is unreliable because you do not know what the system is missing.

Where RAG is heading

The boundary between RAG and agent-based approaches is blurring. Agentic RAG -- where retrieval is performed by an agent that reasons about what to retrieve, refines queries, and iterates -- combines index efficiency with agent flexibility. Long context windows are also shifting the calculus. A 1M-token context window can hold several hundred pages of text. For many document processing tasks, that is the entire collection.

The future likely involves adaptive systems that choose the right retrieval strategy based on the collection and the query: direct file reading for small collections, RAG for large ones, hybrid approaches for everything in between. What will not change is the fundamental trade-off: pre-indexing saves query-time cost at the expense of upfront cost, while direct reading eliminates upfront cost at the expense of query-time cost. The engineering challenge is building systems flexible enough to support both.

Back to all articles