Prompt Engineering for Document Analysis: Patterns That Work
Five proven prompt engineering patterns for document analysis -- extraction, comparison, classification, summarization, and validation -- with strategies for handling long documents and avoiding common failure modes.
Why document prompts are different
General-purpose prompts and document analysis prompts operate under fundamentally different constraints. When you ask a model to write a poem, quality is subjective and there is no ground truth. Document analysis is the opposite. The ground truth is the document itself. The model's job is to accurately extract, interpret, and restructure existing information. A prompt that produces eloquent but inaccurate extraction is worse than useless.
This distinction changes how you design prompts. Creative prompts optimize for fluency. Document prompts optimize for accuracy and completeness. Documents are long, context windows are finite, and output must be structured for downstream consumption. These are engineering problems.
The patterns that follow are practical approaches refined through real-world use.
Pattern 1: Extraction prompts
Extraction is the most common document analysis task: pulling structured data from unstructured text. Invoice line items, contract dates, named entities, financial figures, compliance requirements.
The core principle is schema-first prompting. Before asking the model to extract anything, define the exact output structure you expect. Field names, data types, required versus optional fields, and the format for each value.
A weak extraction prompt says: "Extract the key information from this contract." The model decides what is "key," chooses its own field names, and formats the output however it pleases. Every document produces a different structure. The results are impossible to process programmatically.
A strong extraction prompt provides the schema upfront. It specifies every field: party names (as an array of strings), effective date (ISO 8601 format), termination date (ISO 8601 or null if not specified), governing law jurisdiction (string), total contract value (number in USD, or null). The model fills in the schema. The output is consistent across documents.
The second principle is handling absence explicitly. Documents frequently omit information. A contract might not specify a termination date. An invoice might lack a purchase order number. The prompt must instruct the model on what to do when a field is missing: return null, return a specific sentinel value, or flag it as not found. Without this instruction, models often hallucinate plausible-sounding values to fill gaps -- a termination date of "December 31, 2025" that appears nowhere in the document.
The third principle is anchoring extraction to source text. For high-stakes extraction, instruct the model to include the exact source text alongside each extracted value. This creates a verifiable chain: the model says the contract value is $2.4 million, and it quotes the sentence "The total consideration shall not exceed Two Million Four Hundred Thousand Dollars ($2,400,000)." A human reviewer can verify the extraction against the source in seconds.
Pattern 2: Comparison prompts
Comparing documents -- two versions of a contract, competing vendor proposals, this quarter's financials versus last quarter's -- requires a different prompt architecture than extraction.
The naive approach is to put both documents in the context and ask "what are the differences?" This produces inconsistent results. The model may focus on superficial textual differences (word choice, formatting) while missing substantive changes (a liability cap that doubled, a new indemnification clause).
Effective comparison prompts use a two-phase structure. Phase one extracts a standardized representation from each document independently, using the extraction pattern described above. Phase two compares the standardized representations against each other.
This works because comparison is easier when the inputs have the same structure. Comparing two JSON objects field by field is more reliable than comparing two 30-page documents holistically. The model's attention is focused on specific, structured data points rather than diffused across thousands of tokens of raw text.
For document version comparison specifically, the prompt should distinguish between categories of changes: additions (clauses present in version B but not version A), deletions (clauses present in A but not B), modifications (same clause, different terms), and unchanged sections. Categorizing changes gives the output structure and ensures completeness. Without these categories, the model tends to report the most obvious changes and skip subtle ones.
A practical enhancement is materiality filtering. Not all differences matter equally. A changed font size is noise. A changed limitation of liability is signal. The prompt can instruct the model to classify each change by significance: material (affects rights, obligations, or financial terms), administrative (affects process but not substance), and cosmetic (formatting, typos, rewording without substance change). This saves the human reviewer from sifting through dozens of irrelevant formatting changes to find the three clauses that actually changed.
Pattern 3: Classification prompts
Classification prompts sort documents or sections into predefined categories. Document type classification (is this an invoice, a contract, a letter, or a report?), clause classification (is this an indemnification clause, a limitation of liability, or a force majeure provision?), or risk classification (does this document contain high, medium, or low regulatory risk?).
The key to reliable classification is exhaustive category definitions. Do not assume the model shares your understanding of what "high risk" means. Define it explicitly. "High risk: documents containing personal health information, financial account numbers, or trade secrets. Medium risk: documents containing personally identifiable information such as names and addresses but no sensitive financial or health data. Low risk: documents containing no personal or sensitive information."
Each category needs both positive examples (what belongs in this category) and boundary conditions (edge cases that might be confusing). A document mentioning "health insurance premiums" could be classified as health-related or financial depending on context. The prompt should address these ambiguities.
For multi-label classification -- where a document can belong to multiple categories simultaneously -- the prompt must explicitly state that categories are not mutually exclusive. Without this instruction, models tend to pick a single "best" category even when multiple apply. A contract amendment might be both a "legal document" and a "financial document." The prompt should make clear that returning both labels is expected and correct.
Confidence scoring adds another layer of utility. Instruct the model to provide a confidence level for each classification. This is not a rigorous probability estimate -- language models are not calibrated classifiers -- but it surfaces the cases where the model is uncertain. A document classified as "invoice" with high confidence probably is an invoice. A document classified as "invoice" with low confidence might be a payment reminder or a pro forma estimate that resembles an invoice. The low-confidence flag tells the human reviewer where to focus their attention.
Pattern 4: Summarization prompts
Summarization is deceptively difficult. Anyone can ask a model to "summarize this document." Getting a summary that is accurate, complete, appropriately detailed, and useful for a specific purpose requires more precision.
The first design choice is purpose-driven summarization. A summary for a legal review has different requirements than a summary for an executive briefing or a summary for a database record. The prompt must specify who will read the summary and what decisions it needs to support. "Summarize this contract for a legal team reviewing it for non-standard terms" produces very different output than "summarize this contract for an executive who needs to know whether to sign it."
The second challenge is progressive summarization for long documents. A 200-page report cannot be summarized in a single pass if it exceeds the model's context window. Even when it fits, the model's attention degrades over very long inputs -- sections near the middle get less thorough treatment than sections at the beginning and end.
Progressive summarization works in stages. First, the document is divided into logical sections -- chapters, major headings, or fixed-size chunks with overlap. Each section is summarized independently. Then the section summaries are concatenated and summarized again to produce the final summary. This two-stage approach ensures that every part of the document receives equal attention.
The risk of progressive summarization is information loss between stages. A detail that seems unimportant in the context of one section might be critical in the context of the whole document. To mitigate this, the section-level prompt should err on the side of inclusion: "Include all factual claims, numerical data, named entities, and obligations mentioned in this section." The second-stage prompt then filters and prioritizes.
A practical technique is structured summarization. Instead of asking for a narrative summary, ask for a summary organized into specific sections: key findings, financial figures, obligations and deadlines, risks and concerns, and recommended actions. This structure forces completeness -- the model must address each section rather than gravitating toward whatever seems most interesting.
Pattern 5: Validation prompts
Validation prompts cross-reference data points across multiple sources. Does the invoice total match the purchase order amount? Does the contract reference the correct party names from the corporate registration? Do the financial figures in the quarterly report match the underlying ledger data?
Validation is where prompt engineering intersects most directly with accuracy requirements. A missed discrepancy is not just an incomplete analysis -- it is a failure of the system's core purpose.
The pattern for validation prompts has three parts. First, extract the data points to be validated from each source, using the extraction pattern. Second, define the validation rules explicitly in the prompt. "The invoice total must equal the sum of line item amounts. The invoice date must fall within the contract period. The vendor name on the invoice must match the vendor name on the purchase order." Third, require the model to evaluate each rule independently and report the result -- pass, fail, or unable to verify (when the source data is insufficient).
The "unable to verify" category is critical. Models that are forced into a binary pass/fail will make a judgment even when the evidence is ambiguous. Allowing a third option -- "I cannot determine this from the available information" -- reduces false positives and false negatives alike.
For complex validation involving numerical calculations, instruct the model to show its work. "Sum the line items: $4,200 + $1,800 + $3,500 = $9,500. Invoice total states $9,800. Discrepancy: $300." This chain-of-reasoning approach catches arithmetic errors that the model might make in its head. It also makes the validation auditable -- a human can check the math.
Few-shot vs zero-shot for document tasks
Zero-shot works well when the output format is simple and the task is unambiguous. Extracting a date, classifying a document into three types, checking for a clause. The model understands the task from the instruction alone.
Few-shot is essential when the output format is complex or the task involves judgment calls. One or two examples of correct output communicate a complex JSON schema better than paragraphs of description. If "material change" means something specific in your domain, examples define the boundary more effectively than definitions.
The trade-off is context space. Every token used for examples is a token not available for document content. In practice, one or two well-chosen examples provide most of the benefit. The marginal return drops sharply after that.
A hybrid approach works well: include examples in the system prompt (cached and reused across requests) and keep the user prompt focused on the document. This separates the stable task definition from the variable input and benefits prompt caching.
Handling long documents: chunked prompts with context carryover
When a document exceeds the model's context window or is long enough that attention quality degrades, you need a chunking strategy.
Naive chunking creates problems. A sentence split across chunks loses meaning. A pronoun in chunk 5 referencing an entity from chunk 2 becomes unresolvable.
Overlapping chunks solve boundary splits. Each chunk includes the last N sentences of the previous chunk, ensuring continuity. Context carryover solves references: after processing chunk N, extract a brief context summary (key entities, running totals, open questions) and pass it as a preamble to chunk N+1.
Chunk size is a trade-off. Smaller chunks mean more passes and more information loss. Larger chunks risk attention degradation. A practical starting point is 60-70% of the context window per chunk, leaving room for instructions, context carryover, and output.
For tasks requiring whole-document understanding, chunking is inherently lossy. The agent architecture handles this through progressive processing: chunk-level analysis first, then a synthesis pass over the intermediate outputs.
Common failure modes
Document analysis prompts fail in predictable ways. Knowing these patterns helps you design prompts that avoid them.
Hallucinated data is the most dangerous failure. The model produces values that look correct but do not appear in the source document. This happens most often with numerical data, dates, and names -- exactly the fields where accuracy matters most. Mitigation: require source text citations for extracted values, and instruct the model to return null rather than guess when data is not found.
Missed sections occur when the model's attention is unevenly distributed across long inputs. The model carefully analyzes the first five pages and the last two pages but glosses over the middle forty. Mitigation: chunked processing with equal attention to each chunk, and a completeness check in the final merge step.
Format drift is when the model's output format gradually changes across a batch of documents. The first ten extractions follow the schema perfectly. By document twenty, field names have shifted, nesting has changed, or optional fields have appeared. Mitigation: include the output schema in every prompt (not just the first one in a batch), and validate the output format programmatically before accepting it.
Over-extraction is the model returning information you did not ask for, padding the output with tangentially related data. This is more annoying than dangerous, but it clutters the results and can confuse downstream processing. Mitigation: explicit negative instructions -- "Do not include X, Y, or Z. Only return the fields specified in the schema."
Conflation of similar concepts occurs when the model merges distinct but similar items. Two separate indemnification clauses in the same contract get combined into one. Three invoices from the same vendor get their line items mixed together. Mitigation: number or label the items in the prompt ("Extract Indemnification Clause 1 and Indemnification Clause 2 separately") and instruct the model to preserve the document's own distinctions.
System prompts vs user prompts: what goes where
The division between system prompt and user prompt is an engineering decision that affects both performance and maintainability.
System prompts should contain stable instructions: the output schema, the role definition ("You are a document analysis specialist"), the general rules (how to handle missing data, what format to use for dates, when to return null vs skip a field), and any few-shot examples. These instructions are the same across every document in a given analysis type.
User prompts should contain variable content: the document text itself, any document-specific instructions ("This is a German-language contract; extract party names in the original language"), and the specific query if the analysis is interactive.
This separation has a practical benefit beyond cleanliness: prompt caching. Systems that cache the system prompt across requests -- as most production LLM providers now support -- pay the full processing cost only once for the stable instructions. Subsequent requests with different documents but the same system prompt hit the cache, reducing latency and cost.
If you put the document text in the system prompt, every new document invalidates the cache. If you put the schema in the user prompt, you are sending it fresh every time instead of benefiting from caching.
The cache boundary is also the stability boundary. Things that change per-document go in the user prompt. Things that change per-task-type go in the system prompt. Things that change per-deployment go in configuration outside the prompt entirely.
Testing prompts: evaluating quality systematically
Prompt engineering without systematic testing is guesswork. You change a word, run it on one document, see improvement, and ship. Then it fails on documents with different formatting or structure.
Systematic testing requires three things: a test set, metrics, and a comparison methodology.
The test set should be diverse: easy cases (well-formatted), hard cases (unusual formatting, missing fields), and adversarial cases (contradictory defined terms, duplicate line items). Twenty to thirty documents is a practical minimum.
Metrics depend on the task. For extraction: field-level accuracy, completeness, and false positive rate. For comparison: change detection recall and precision. For classification: precision, recall, and F1 per category.
The comparison methodology: run Prompt A against the full test set, record metrics, modify the prompt, run Prompt B, compare. If Prompt B improves accuracy from 87% to 93% but drops completeness from 95% to 82%, you have a trade-off -- not a clear win.
Version your prompts and automate evaluation. Build a harness that runs prompts against the test set, compares against ground truth, and reports metrics. This turns prompt iteration from art into engineering.
At docrew, prompt evaluation is part of the development cycle. Changes to analysis prompts go through the same testing rigor as code changes. Measure before you ship.
Where this is heading
Prompt engineering for document analysis is maturing from ad hoc experimentation into a discipline with repeatable patterns. The five patterns described here -- extraction, comparison, classification, summarization, and validation -- cover the majority of document analysis tasks. Combining them handles even complex multi-step workflows.
The trend is toward more structured, more testable, more reliable prompt designs. The models are getting better at following complex instructions, which means the ceiling on what well-engineered prompts can achieve keeps rising. But the floor -- what happens when prompts are poorly designed -- stays the same. Garbage in, garbage out, regardless of model capability.
The organizations that get the most value from AI document analysis are not the ones with the most powerful models. They are the ones with the most disciplined prompt engineering practices: defined schemas, systematic testing, explicit handling of edge cases, and continuous measurement of accuracy against real documents.
The model is the engine. The prompt is the steering wheel. Both matter. But most people are driving with their knees.