February 17, 202616 min read

Benchmarking AI Models for Document Tasks: What to Measure

General AI benchmarks fail to predict document processing performance. Learn how to measure extraction accuracy, comparison quality, speed, cost, and build evaluation sets that reflect real document work.

Why general benchmarks miss the point

Every major AI model release comes with a scorecard. MMLU scores. HumanEval pass rates. MATH benchmarks. These numbers are useful for comparing general reasoning ability. They are nearly useless for predicting how well a model will process your documents.

A model that scores 90% on MMLU might extract invoice line items with 72% accuracy. The skills that general benchmarks measure -- broad knowledge recall, mathematical reasoning, code generation -- overlap only partially with what document analysis demands: precise extraction from noisy inputs, faithful schema adherence, long-context tolerance, and the ability to distinguish source information from hallucinated content.

The only reliable predictor of document task performance is document task performance. You have to measure it yourself. This article describes what to measure, how to measure it, and which mistakes to avoid.

Extraction accuracy: the foundational metric

Extraction -- pulling structured data from unstructured documents -- is the most common document analysis task and the easiest to measure rigorously.

Field-level accuracy is the primary metric. For each field in your extraction schema, compare the model's output against the ground truth value. Did the model extract the correct contract date, the correct total amount, the correct party names? Each field is a separate measurement point.

Field-level accuracy matters more than document-level accuracy because it reveals where the model succeeds and where it fails. A model that extracts dates with 98% accuracy but party names with 75% accuracy has a specific weakness. A model that scores 88% document-level accuracy hides the distribution of errors.

Exact match vs fuzzy match is a design decision you need to make early. Should "Acme Corporation" and "Acme Corp." count as a match? Should "$1,200,000" and "1200000" count as a match? For names, fuzzy matching (normalized string comparison, alias resolution) is usually appropriate. For numerical values, exact semantic matching (after format normalization) is correct. For dates, normalize to a canonical format before comparing.

F1 score is more informative than raw accuracy for extraction tasks where the number of items varies per document. An invoice might have 3 line items or 30. Accuracy alone does not distinguish between a model that finds all 30 items and one that finds 25 out of 30 but also hallucinates 5 items that do not exist. F1 combines precision (of the items extracted, how many are correct?) and recall (of the items that exist, how many were found?) into a single metric that penalizes both missed items and fabricated items.

Null handling accuracy deserves its own measurement. When a field is absent from the document, does the model correctly return null or "not found"? Or does it hallucinate a plausible value? Null handling is where many models fail silently. The model confidently returns a termination date of "December 31, 2026" for a contract that specifies no termination date. This is a particularly dangerous error because it looks like a valid extraction.

At docrew, extraction accuracy is the metric we track most closely during model evaluation. It directly determines whether the output is usable or requires human correction.

Table parsing accuracy

Tables are a distinct challenge that deserves separate measurement. Documents are full of tables -- financial statements, invoice line items, comparison matrices, specification sheets -- and models handle them with widely varying quality.

Cell-level accuracy measures whether the model correctly extracts the value of each cell. A 10-row, 5-column table has 50 cells. If the model gets 45 right, that is 90% cell-level accuracy. But which 5 cells did it miss? If they are all in the "Amount" column, the table is useless for financial analysis despite the high overall score.

Structure preservation measures whether the model maintains the row-column relationships. A common failure mode is "row shifting" -- the model associates a vendor name with the wrong invoice amount because it lost track of which row it was processing. Structure preservation checks that values in the same row of the output actually correspond to values in the same row of the source.

Header recognition tests whether the model correctly identifies column and row headers versus data cells. A model that treats "Total" (a row header in a summary row) as a vendor name has misunderstood the table structure.

Tables in PDFs are especially challenging because the table structure is not encoded in the file format -- it is inferred visually from the layout of text elements. Different models handle this inference with dramatically different accuracy. Measuring table parsing explicitly, rather than lumping it in with general extraction, reveals these differences.

Comparison accuracy

For document comparison tasks -- identifying changes between contract versions, differences between vendor proposals, discrepancies between related documents -- the metrics map to the categories of changes.

Change detection recall: of all the actual changes between two documents, what percentage did the model find? This is the most important metric because missed changes are the primary risk. A legal team that trusts the comparison output and signs a contract with an undetected clause change has a real problem.

Change detection precision: of all the changes the model reported, what percentage are actual changes? False positives are less dangerous than false negatives -- a human reviewer can quickly dismiss a spurious change -- but too many false positives erode trust and waste review time.

Change classification accuracy: when the model reports a change, does it correctly classify the nature of the change? Did it correctly identify that a clause was modified (not added or deleted)? Did it correctly label a change as material versus cosmetic? Classification errors do not cause missed changes, but they affect the usefulness of the output for prioritized review.

Semantic vs textual change detection is the subtlest measurement. "The Seller shall deliver within thirty (30) days" changed to "Delivery shall occur within 30 business days." A textual diff catches the word changes. A semantic comparison catches the substantive change: calendar days became business days, potentially adding two weeks to the delivery timeline. Test whether the model identifies the semantic implication, not just the textual difference.

Summarization quality

Summarization quality is harder to measure than extraction or comparison because there is no single correct answer. Two good summaries of the same document can be quite different. But there are measurable dimensions.

Factual consistency is non-negotiable. Every claim in the summary must be supported by the source document. A summary that states "the contract value is $3.2 million" when the document says "$2.3 million" has failed regardless of how well-organized or well-written it is. Factual consistency can be measured by extracting the factual claims from the summary and checking each one against the source.

Coverage measures whether the summary addresses all the important topics in the document. If a 50-page report has six major sections and the summary covers four of them, it has a coverage gap. Coverage testing requires a ground truth annotation of the document's key topics, which is labor-intensive to create but reusable across model evaluations.

Conciseness is the ratio of summary length to source length, adjusted for information density. A summary that is 40% as long as the original but conveys 90% of the key points is well-calibrated. A summary that is 40% as long but repeats the same three points in different words is not concise -- it is redundant. Conciseness is measured by information density, not word count alone.

Relevance to purpose measures whether the summary emphasizes the right things for its intended audience. A legal review summary should highlight non-standard clauses and risk factors. An executive summary should highlight financial terms and strategic implications. The same document, summarized for different purposes, should produce different outputs. Testing this requires purpose-specific ground truth annotations.

In practice, factual consistency is the gate. A summary that is not factually consistent fails regardless of its other qualities. Coverage and conciseness are the optimization targets once consistency is established.

Speed benchmarks

Speed matters in document processing because documents come in batches. A model that takes twice as long per document costs twice as much wall-clock time across a 200-document review.

Tokens per second is the raw throughput metric. It is easy to measure and useful for comparing models at the infrastructure level, but it does not directly predict task completion time because different prompts and documents produce different token counts.

End-to-end latency is more useful: how long from submitting the document to receiving the complete output? This includes prompt processing time, generation time, and any overhead from the API or infrastructure. End-to-end latency is what the user experiences.

Time to first token matters for interactive use cases where the user is watching streaming output. A model that starts responding in 200 milliseconds feels responsive even if the full response takes 30 seconds. A model that sits silently for 5 seconds before producing any output feels slow even if the total time is the same.

Throughput under concurrency measures how the model performs when processing multiple documents simultaneously. Some models maintain near-constant per-request latency up to a concurrency threshold, then degrade sharply. Others degrade gradually. For batch processing -- the dominant pattern in document analysis -- throughput under concurrency determines the practical processing speed.

Speed benchmarks should be run under realistic conditions. Testing a model's latency with a 100-token prompt does not predict its latency with a 50,000-token prompt containing a full contract. Test with documents that match your production workload in length and complexity.

Cost benchmarks

Cost is the metric that turns theoretical model comparisons into business decisions.

Cost per document is the headline number: how much does it cost to process one document through your pipeline? This depends on document length (more tokens in, more cost), output complexity (more tokens out, more cost), and the model's pricing structure (input vs output token prices, caching discounts, batch pricing).

Cost per extracted field provides a more granular view. If you are extracting 15 fields from each document, the cost per field reveals whether the extraction is economically viable for your use case. At $0.02 per document, extracting 15 fields costs roughly $0.001 per field. That is viable for high-volume processing. At $0.50 per document, the same extraction costs $0.03 per field -- still viable for high-value documents, less so for routine processing.

Cost at scale is the projection that matters for production planning. Processing 100 documents during evaluation costs $2 or $50. Processing 10,000 per month costs $200 or $5,000. Multiply cost-per-document by projected volume to determine sustainability.

Cost-accuracy trade-off is the most important analysis. If Model A extracts at 95% accuracy for $0.03 per document and Model B at 91% for $0.005, the decision depends on the cost of human correction for the 4% gap versus the 6x cost difference. For most pipelines, model cost is small relative to human review cost. A model that costs three times more but reduces correction time by 30% is a net savings.

Building your own evaluation set

Off-the-shelf benchmarks do not test your documents, your schema, your edge cases. You need a custom evaluation set.

Start with real documents. Synthetic documents -- generated or modified to create a test set -- introduce biases. They tend to be more regular, more consistent, and more predictable than real documents. Your evaluation set should contain actual documents from your pipeline, with all their messiness: inconsistent formatting, unusual layouts, missing fields, scanned pages with OCR artifacts.

Annotate ground truth carefully. For each document, create the correct extraction output by hand. This is tedious. It is also the foundation of your entire evaluation infrastructure. Ground truth annotations that are themselves inaccurate produce misleading metrics. Have two people annotate independently and resolve disagreements. For extraction tasks, this means filling in every field of the schema for every document. For comparison tasks, this means cataloging every change between document pairs.

Stratify by difficulty. Your evaluation set should include easy documents (clean formatting, standard layout, all fields present), medium documents (some unusual formatting, some missing fields), and hard documents (poor scan quality, non-standard layouts, ambiguous content). Weight the strata to match your production distribution. If 80% of your real documents are straightforward and 20% are challenging, your evaluation set should reflect that ratio.

Include adversarial examples. Documents that are specifically designed to trip up the model: a contract where the effective date appears in the header but a different date appears in the signature block, an invoice where the subtotal and total do not match, a report where the executive summary contradicts the detailed findings. These adversarial cases test the model's robustness, not just its performance on cooperative inputs.

Size the set appropriately. Twenty documents is the minimum for any meaningful evaluation. Fifty gives you reasonable statistical stability. One hundred or more is ideal if you can afford the annotation effort. The evaluation set is a long-term investment -- you will reuse it every time you evaluate a new model, update your prompts, or change your pipeline.

Version and maintain it. As your document types evolve and your schema changes, update the evaluation set. Add new documents that represent new patterns you encounter in production. Remove documents that are no longer representative. An evaluation set that was created once and never updated gradually diverges from the real workload and produces increasingly misleading results.

Model comparison methodology

With an evaluation set and metrics defined, the comparison methodology determines whether your conclusions are reliable.

Controlled conditions are essential. Run all models against the exact same documents, with the exact same prompts (adapted only for model-specific formatting requirements), at the same time. Comparing Model A's results from Tuesday with Model B's results from Thursday introduces variables: API performance fluctuations, prompt differences, even document ordering effects.

Statistical significance matters. If Model A scores 88% extraction accuracy and Model B scores 86% on a 30-document test set, that 2-point difference might not be meaningful. With 30 documents, the confidence interval is wide. You need a large enough sample or a large enough performance gap to draw reliable conclusions. A simple paired comparison test (comparing per-document scores between models) tells you whether the difference is likely real or likely noise.

Run multiple trials. Language models are stochastic. The same prompt with the same document can produce slightly different outputs on different runs. Run each model-document pair three to five times and report the average. If a model's extraction accuracy varies from 85% to 93% across runs on the same document, the model's reliability is itself a finding worth reporting.

Control for prompt optimization. If you spent three weeks optimizing prompts for Model A and ten minutes adapting them for Model B, the comparison is biased. Either invest comparable prompt engineering effort for each model, or test with generic prompts and note the results reflect unoptimized performance.

Document your methodology. When you revisit the comparison six months later -- because a new model version was released or your requirements changed -- you need to reproduce the exact conditions. Record the model versions, the prompt texts, the evaluation set version, the date of testing, and the scoring methodology. Future-you will thank present-you.

Common benchmarking mistakes

Having seen many teams evaluate models for document processing, the same mistakes recur.

Testing on easy documents only. The team grabs ten clean, well-formatted invoices, runs them through the model, gets 98% accuracy, and declares victory. Then production documents arrive: scanned PDFs with OCR errors, handwritten annotations, multi-language content, tables that span page breaks. Accuracy drops to 70%. The evaluation predicted nothing because it did not test the hard cases.

Ignoring edge cases. A contract with no termination date. An invoice with negative line items (credits). A report in landscape orientation. A spreadsheet where merged cells break the table structure. Edge cases are where models differentiate themselves. A model that handles the common case well is a commodity. A model that handles edge cases gracefully is valuable.

Not measuring cost. Two models with identical accuracy at different price points are not equivalent. A model that is 10% more accurate but 5x more expensive might not be the right choice for high-volume, low-value document processing. Cost is part of the evaluation, not an afterthought.

Evaluating extraction without measuring hallucination. A model that extracts 95% of fields correctly sounds good until you discover that 8% of its outputs contain hallucinated values -- plausible-looking data that does not exist in the source document. Measuring only what the model gets right, without measuring what it gets wrong, paints an incomplete picture.

Single-document evaluation. Testing each model on one or two documents and drawing conclusions. This is anecdotal evidence, not benchmarking. A single document might hit a model's strengths or its weaknesses by chance. Systematic evaluation requires a diverse, representative set.

Conflating speed and quality. A fast model that produces mediocre results and a slow model that produces excellent results serve different use cases. Report speed and quality independently. Let the decision-maker weigh the trade-off based on their specific requirements.

Benchmarking once and never again. Models improve. Prompts evolve. Documents change. Build evaluation into your regular process, not as a one-time event.

The real benchmark

After all the metrics and methodology, one question supersedes everything: does this model, with these prompts, on these documents, save time on actual work?

A model that scores 93% on extraction accuracy but requires manual reformatting might save less time than a model at 88% that produces clean, actionable output. The metric that matters is total human time from "I have these documents" to "the work is done."

The model with the highest extraction accuracy is not always the fastest to work with. The model that handles edge cases gracefully, even if average-case performance is slightly lower, might produce the best outcomes by reducing variance in human review time.

At docrew, the decision of which model to use for which task comes down to total time saved on real work. We measure the standard metrics to understand strengths and weaknesses. But the benchmark that pays the bills is end-to-end time savings on actual documents.

The published leaderboards are a starting point. Your documents, your tasks, and your evaluation set are where the real answers live.

Back to all articles