16 min read

Document Chunking Strategies: How to Split Files for AI

A technical guide to document chunking strategies for AI processing -- fixed-size, semantic, structural, and sliding window approaches, with trade-offs for RAG pipelines and large language models.


The chunking problem

Every document processing system that uses a language model faces the same question: how do you break a large document into pieces the model can work with?

The question sounds trivial. It is not.

A 200-page contract does not fit in a single model context window -- or if it does, it wastes tokens and money on content irrelevant to the current question. A retrieval-augmented generation (RAG) system needs to store document pieces in a vector database and retrieve only the relevant ones. A summarization pipeline needs to process sections independently before combining them.

In all these cases, you need to split the document into chunks. The way you split it determines the quality of everything downstream. Bad chunks produce bad retrieval, bad analysis, and bad answers.

There is no universal chunking strategy. The right approach depends on the document type, the task, the model, and the cost constraints. But the trade-offs are well understood, and understanding them is the difference between a system that works reliably and one that produces inconsistent results.

Why chunking matters

Three forces drive the need for chunking: context window limits, retrieval accuracy, and cost.

Context window limits are the most obvious driver. Even with models that support million-token context windows, stuffing an entire document into every request is wasteful. If a user asks "what is the termination clause in this contract," sending the entire 200-page agreement to the model means the model processes 199 pages of irrelevant content. This costs tokens, adds latency, and can actually reduce answer quality because the model has to find the needle in the haystack.

Retrieval accuracy matters in RAG systems. When a document is embedded in a vector database, the quality of retrieval depends entirely on the quality of the chunks. If a chunk contains the answer to the user's question mixed with unrelated content, the embedding will be a noisy average of relevant and irrelevant information. The retrieval system may rank it lower than a clean, focused chunk -- or worse, retrieve a different chunk entirely.

Cost compounds the problem. Language model pricing is per-token. Processing an entire document for every question multiplies the cost by the number of questions. Chunking lets you process only the relevant portions, reducing per-query cost dramatically. For a system handling thousands of queries per day across hundreds of documents, the cost difference between sending full documents and sending targeted chunks can be orders of magnitude.

Fixed-size chunking

The simplest approach: split the document into pieces of N characters or N tokens.

Fixed-size chunking is predictable and easy to implement. You pick a size -- say 1,000 tokens -- and walk through the document, cutting every 1,000 tokens. Every chunk is the same size. Every chunk fits within your model's context window. Implementation is a for loop.

The problem is that fixed-size chunking is linguistically blind. It cuts mid-sentence, mid-paragraph, mid-thought. A chunk might start in the middle of a clause definition and end in the middle of an unrelated clause. The chunk boundary has no relationship to the document's meaning.

For embedding-based retrieval, this is particularly damaging. A sentence split across two chunks has its meaning diluted in both. The embedding of each half-sentence captures partial meaning, making neither chunk a strong match for queries about that sentence.

Fixed-size chunking works acceptably when the document has no meaningful structure -- raw transcripts, log files, stream-of-consciousness text. For structured documents like contracts, reports, or technical papers, it discards the very structure that makes the document comprehensible.

When to use it: Prototyping, unstructured text, or as a fallback when no structure can be detected. Not for production systems working with structured documents.

Semantic chunking

Semantic chunking splits text at meaning boundaries -- paragraph breaks, topic shifts, or changes in subject matter.

The simplest form of semantic chunking uses paragraph breaks as chunk boundaries. Paragraphs are natural units of meaning in most documents. Each paragraph typically covers one idea or one point. Splitting at paragraph boundaries preserves complete thoughts.

More sophisticated semantic chunking uses embedding similarity to detect topic shifts. The system computes embeddings for consecutive sentences, and when the similarity between adjacent sentences drops below a threshold, it inserts a chunk boundary. This captures topic transitions even when the document does not have explicit paragraph breaks.

The advantage of semantic chunking is that each chunk contains a coherent idea. The embedding of the chunk accurately represents its content. Retrieval accuracy improves because queries match against focused, meaningful chunks rather than arbitrary slices.

The disadvantage is inconsistent chunk sizes. Some paragraphs are one sentence. Some are ten paragraphs long. Very short chunks lack context for the model to work with. Very long chunks reintroduce the problems of fixed-size chunking.

A common mitigation is to merge short chunks together and split long ones. Paragraphs under 100 tokens get merged with the next paragraph. Paragraphs over 2,000 tokens get split at sentence boundaries. This maintains semantic coherence while keeping chunks within a usable size range.

When to use it: General-purpose document processing, articles, reports, and any document where paragraph structure is meaningful.

Structural chunking

Structural chunking follows the document's own hierarchy: headings, sections, chapters, subsections.

A well-structured document has an outline. Chapters contain sections, sections contain subsections, subsections contain paragraphs. This hierarchy is meaningful -- the author organized the document this way because the content groups naturally into these divisions.

Structural chunking respects that organization. Each section becomes a chunk. Each subsection becomes a chunk. The hierarchy is preserved in metadata, so the system knows that chunk 3.2.1 belongs to section 3.2, which belongs to chapter 3.

For retrieval, structural chunking produces highly focused chunks. A query about "limitation of liability" retrieves the limitation of liability section, not a random paragraph that happens to mention liability. The heading itself provides strong semantic signal that improves retrieval accuracy.

The challenge is detecting structure. DOCX files have explicit heading styles -- Heading 1, Heading 2, and so on. A proper parser can extract the hierarchy directly. PDF files are harder. A PDF has no semantic heading information; headings are just larger or bolder text. Detecting them requires heuristic analysis of font sizes, weights, and spacing. Markdown and HTML have explicit heading tags. Plain text has none.

Another challenge is hierarchical depth. Splitting at the top level (chapters) produces very large chunks. Splitting at the deepest level (sub-subsections) might produce very small ones. The right level depends on the document and the task. For a contract, splitting at the clause level (typically Heading 2 or 3) produces chunks of a few hundred to a few thousand tokens -- a useful size for most tasks.

When to use it: Contracts, legal documents, technical papers, regulatory filings, manuals -- any document with explicit hierarchical structure.

Sliding window chunking

Sliding window chunking adds overlap between consecutive chunks. Instead of cutting cleanly at position N, the next chunk starts at position N minus some overlap.

The overlap exists to solve the boundary problem. When fixed-size or semantic chunking splits a document, information at the boundary is split between two chunks. A sentence that spans the boundary loses context in both chunks. Overlap ensures that boundary content appears in at least two chunks, so retrieval can find it regardless of which chunk it matches.

Typical overlap ranges from 10% to 25% of the chunk size. A 1,000-token chunk with 20% overlap starts its next chunk 800 tokens later, so the last 200 tokens of chunk N are also the first 200 tokens of chunk N+1.

The trade-off is redundancy. Overlap increases the total number of chunks and the storage requirements of the vector database. A 100,000-token document with 1,000-token chunks and no overlap produces 100 chunks. With 20% overlap, it produces 125 chunks. At 50% overlap, 200 chunks.

More chunks means more embedding computations, more storage, and more retrieval candidates to rank. The retrieval system might return overlapping chunks that share significant content, wasting context window space on duplicated text. Deduplication at retrieval time -- merging overlapping chunks before sending them to the model -- mitigates this.

When to use it: In RAG pipelines where boundary splitting causes retrieval misses. Less necessary when using semantic or structural chunking, where boundaries align with natural meaning boundaries.

Document-type-specific strategies

Different document types have different natural units. The best chunking strategy respects the document's native structure.

Contracts should be chunked by clause. A contract is a collection of numbered clauses, each addressing a specific topic: definitions, obligations, termination, liability, confidentiality. Chunking by clause produces semantically complete units that map directly to the questions users ask. "What does clause 7.3 say about indemnification?" retrieves exactly clause 7.3.

Academic papers split naturally at the section level: abstract, introduction, methodology, results, discussion, conclusion. Each section serves a distinct purpose. Chunking at the section level lets the retrieval system return the methodology when asked "how was the study conducted?" and the results when asked "what were the findings?"

Spreadsheets require a different approach entirely. A spreadsheet is not running text -- it is structured tabular data. Chunking by sheet is the coarsest approach. Chunking by table (if a sheet contains multiple tables separated by blank rows) is finer-grained. For very large tables, chunking by row groups with column headers repeated in each chunk preserves the tabular structure within each chunk.

PDFs present a choice: chunk by page or by section. Page-based chunking is easy (every PDF parser can extract text by page) but semantically weak -- a section that spans pages 4 and 5 gets split between two chunks. Section-based chunking is semantically stronger but requires detecting section boundaries in a format that has no semantic markup.

Email threads chunk naturally by individual message. Each email in a thread is a self-contained communication. Chunking by message preserves the sender, timestamp, and context of each contribution.

Meeting transcripts can be chunked by speaker turn, by topic segment (detected via topic modeling), or by time windows. Speaker-turn chunking preserves conversational context but may produce very short chunks. Topic-based chunking is harder to implement but produces more semantically meaningful units.

Chunk size trade-offs

The optimal chunk size is a compromise between competing forces.

Too small (under 100 tokens): the chunk lacks context. A single sentence removed from its surrounding paragraphs loses meaning. The embedding captures the words but not the context. Retrieval may find the chunk but the model cannot make sense of it without the surrounding information.

Too large (over 5,000 tokens): the chunk is diluted. It contains multiple topics, and the embedding is an average that represents none of them well. Retrieval accuracy drops because the chunk matches many queries weakly rather than one query strongly. The model receives more content than it needs, increasing cost and potentially reducing focus.

The practical sweet spot for most document processing tasks is 200 to 1,500 tokens per chunk. This range is large enough to contain a complete thought or section, and small enough to be semantically focused.

But the right size depends on the task. Summarization benefits from larger chunks (the model needs enough context to produce a meaningful summary). Question answering benefits from smaller, focused chunks (the model needs the specific paragraph that contains the answer). Classification can work with very small chunks (a few sentences are enough to classify a document type).

The model's context window also matters. If the retrieval system returns 10 chunks and each is 1,500 tokens, that is 15,000 tokens of context. Add the system prompt and the query, and the total request might be 20,000 tokens. This is fine for models with 128K or 1M context windows. For models with 8K context windows, either the chunks need to be smaller or fewer chunks should be retrieved.

Metadata preservation

A chunk without metadata is a fragment without context. Knowing that a paragraph says "the liability cap is $5 million" is useful. Knowing that it comes from "Section 12.3: Limitation of Liability" of "Vendor Agreement dated January 15, 2026" is far more useful.

Effective chunking systems attach metadata to every chunk:

Source document: which file did this chunk come from? File name, path, or document ID.

Position: where in the document does this chunk appear? Page number, section number, paragraph index.

Hierarchy: what is the structural context? The heading path -- "Chapter 3 > Section 3.2 > Subsection 3.2.1" -- tells the model where the chunk sits in the document's organization.

Neighboring chunks: what comes before and after? Pointers to adjacent chunks enable the system to retrieve surrounding context when a single chunk is insufficient.

Document-level metadata: author, date, document type, version. This enables filtering at retrieval time -- "find the termination clause, but only in contracts signed after 2025."

When the model receives a chunk for processing, including the metadata as a preamble gives it the context it needs to interpret the content correctly. A chunk that says "30 days" is ambiguous. A chunk prefixed with "From: Vendor Agreement / Section 8: Termination / Subsection 8.2: Notice Period" makes "30 days" unambiguous.

The cost of metadata is storage and a small amount of additional token usage. The benefit is dramatically better retrieval and interpretation. This is almost always a worthwhile trade.

Evaluating chunking strategies

Chunking is not a set-it-and-forget-it decision. Different strategies produce different results, and the only way to know which works best is to measure.

Retrieval accuracy is the primary metric for RAG systems. Given a set of test queries with known relevant passages, what percentage of relevant passages does the retrieval system find? Compare this across chunking strategies. If structural chunking retrieves the correct passage 85% of the time and fixed-size chunking retrieves it 65% of the time, the choice is clear.

End-to-end answer quality is the ultimate metric. Retrieval accuracy is a proxy -- what actually matters is whether the final answer is correct. A chunking strategy that retrieves the right passage but splits it across two chunks might still produce wrong answers if the model only sees one chunk.

Chunk coherence can be measured by asking the model to assess whether each chunk is self-contained and meaningful without additional context. Chunks that require surrounding context to understand are poorly bounded.

Coverage measures whether the entire document is represented in the chunk set. Some chunking strategies may drop content -- table footers, page headers, appendix references. Coverage analysis ensures nothing important is lost.

Cost analysis compares the total tokens consumed per query across strategies. A strategy that produces twice as many chunks costs more per query in retrieval, embedding, and model processing.

In practice, most teams evaluate chunking by testing against a set of representative queries and documents, comparing answer quality across strategies. The strategy that produces the best answers for the most queries, at acceptable cost, wins.

The alternative: just read the whole document

With context windows reaching 1 million tokens and beyond, a reasonable question is: why chunk at all?

A 200-page contract is roughly 100,000 tokens. A model with a 1 million token context window can process it in a single request. No chunking, no retrieval, no embedding. Just send the whole document and ask your question.

This approach has real advantages. No information is lost to chunking boundaries. The model has full context for every question. There is no retrieval step to fail or return wrong chunks. Implementation is trivially simple.

And it has real disadvantages. Cost: processing 100,000 tokens per query is expensive, especially if the user asks ten questions. Latency: the model takes longer to process 100,000 tokens than 2,000. Attention dilution: even with large context windows, models perform better on focused input than on large, mostly-irrelevant context. The relevant paragraph buried in page 147 gets less attention than if it were presented directly.

The hybrid approach is increasingly common: use the full document for the first pass (summarization, overview, classification), then use targeted retrieval for specific questions. The initial pass builds an index or map of the document's contents. Subsequent questions use that map to find and extract the relevant sections without processing the full document again.

docrew takes this pragmatic approach. The agent reads the full document when it needs to understand the overall structure or produce a comprehensive analysis. For specific questions about a long document, it reads the relevant sections rather than reprocessing the entire file. The decision is made by the agent at runtime based on document size, question type, and cost constraints.

Where chunking is heading

The chunking landscape is shifting in two directions simultaneously.

Smarter chunking uses language models themselves to determine chunk boundaries. Instead of heuristic rules, a lightweight model reads the document and identifies natural division points -- topic transitions, argument boundaries, narrative shifts. This produces semantically optimal chunks but adds a preprocessing step and its associated cost.

Less chunking is driven by expanding context windows. As models handle longer inputs efficiently and cheaply, the pressure to chunk diminishes. A model that can process a 500-page document in a single request at low cost makes chunking unnecessary for many use cases.

The likely equilibrium is domain-specific. High-volume production systems processing thousands of documents per hour will continue to chunk for cost and latency reasons. Interactive systems processing one document at a time for a single user may skip chunking entirely and rely on large context windows.

The underlying principle remains constant: the model needs the right context to produce the right answer. Whether you deliver that context through careful chunking and retrieval, or by feeding the entire document, is an engineering trade-off -- not a philosophical one. The strategy that delivers accurate answers at acceptable cost and latency for your specific use case is the right one.

Back to all articles