February 12, 202616 min read

Vision Language Models for Document Understanding

How vision language models surpass OCR for document understanding -- processing layout, tables, handwriting, and charts with spatial reasoning that text extraction cannot replicate.

Beyond reading characters

For decades, getting information out of a document image meant one thing: optical character recognition. Scan the page, detect the characters, output text. Everything downstream -- extraction, classification, analysis -- operated on that text output. The image was discarded after OCR ran.

Vision language models (VLMs) take a fundamentally different approach. They process the document image directly, alongside text. The model sees the page the way a human does -- layout, typography, tables, figures, stamps, handwriting, whitespace, all of it -- and reasons about the visual and textual content simultaneously.

This is not a minor improvement over OCR. It is a category shift. OCR extracts characters. VLMs understand documents.

What VLMs are

A vision language model is a neural network that accepts both images and text as input and produces text as output. It has two encoding pathways -- one for visual input, one for text input -- that merge into a shared representation space where the model reasons across modalities.

The visual encoder (typically a vision transformer) processes the image into a sequence of visual tokens that represent regions, features, and patterns in the image. The text decoder (a standard language model) processes these visual tokens alongside any text input and generates a text response.

The critical capability is that the model does not process the image and text independently. It reasons about their relationship. When a VLM sees a document image and receives the instruction "extract the total amount," it identifies the visual region of the page that contains the total, reads the value from that region, and understands that the number it reads is the answer to the question -- all in one inference pass.

This is different from a pipeline where OCR extracts all text, then a separate model searches the text for the total. The pipeline approach loses the spatial information that tells you where on the page the total appears. The VLM retains it.

The evolution: OCR to layout analysis to VLMs

Document understanding has evolved through three distinct eras.

Era 1: OCR (1990s-2010s). Character recognition on scanned pages. The output is a text stream with no structural information. Every post-processing system -- template matching, regex extraction, zone detection -- existed to compensate for what OCR did not provide: structure and meaning.

Era 2: Layout analysis (2015-2023). Document AI systems combined OCR with layout detection models. These systems identified text regions, tables, figures, and headers in the document image, then used OCR within each region to extract text. The layout model provided structural information that pure OCR lacked. Google's Document AI, Amazon Textract, and Microsoft's Form Recognizer represented this era. They were a genuine improvement -- tables were extracted as tables, not as streams of text with lost column alignment -- but they were still multi-stage pipelines with compounding errors.

Era 3: Vision language models (2024-present). End-to-end multimodal models that process the document image directly. No separate OCR stage. No separate layout analysis. The model sees the page and understands it. Gemini, GPT-4V, and Claude introduced this capability, and it has rapidly matured into the dominant approach for document understanding.

The progression follows a clear arc: each era moves the understanding deeper into the model and eliminates pipeline stages. OCR extracted characters, layout analysis added structure, and VLMs add comprehension.

What VLMs understand that OCR does not

The gap between OCR output and VLM understanding is wide. Here are the specific capabilities that VLMs bring.

Layout and spatial relationships. A VLM knows that the text in the upper-right corner of an invoice is the invoice number because of where it appears, not because it is labeled "Invoice Number." It understands that a signature at the bottom of a page relates to the signer identified at the top. It recognizes that a sidebar contains supplementary information, not the main body text. Spatial reasoning is native to the model, not bolted on.

Tables. Tables are the single most painful failure mode of text-based extraction. OCR reads table cells left-to-right, producing text streams where column alignment is lost. A table with five columns and twenty rows becomes a paragraph of numbers with no structure. VLMs see the table as a visual grid. They understand row-column relationships, merged cells, spanning headers, and nested tables. They extract tabular data in its actual structure.

Handwriting. OCR engines trained on printed text struggle with handwriting. Even specialized handwriting OCR (ICR -- intelligent character recognition) has high error rates on cursive, messy, or overlapping handwriting. VLMs handle handwriting substantially better because they process the visual pattern holistically rather than segmenting individual characters. A handwritten note in the margin of a printed contract is readable to a VLM in a way it is not to most OCR engines.

Stamps and seals. Official stamps, notary seals, company chops -- these contain text embedded in circular or decorative graphics that OCR cannot parse. VLMs read them because they process the entire visual element, not just text regions.

Charts and graphs. A bar chart, a line graph, a pie chart -- these convey quantitative information visually. OCR sees nothing (there is no text to recognize, or only axis labels). A VLM reads the chart and extracts the data: "Revenue grew from $2.3M in Q1 to $3.1M in Q4." It understands the visual encoding of data, not just text.

Forms with checkboxes and radio buttons. A filled form with checked boxes, circled options, or crossed-out entries carries information in visual marks, not text. VLMs interpret these marks in context: "The box next to 'Married' is checked."

Document quality degradation. Faded text, coffee stains, skewed scans, partial page images, low-resolution photographs of documents. VLMs are more robust to visual noise than OCR because they process the image holistically rather than character-by-character. A partially obscured word can often be inferred from surrounding context -- something OCR cannot do.

How VLMs process a document page

The internal processing of a document image in a VLM follows a specific pipeline, though the stages are fused within the model rather than executed as separate steps.

Image encoding. The document page image is resized (or tiled, for high-resolution processing) and fed through a vision transformer. The transformer divides the image into patches (typically 14x14 or 16x16 pixels) and produces an embedding for each patch. These embeddings capture visual features: edges, textures, text regions, graphical elements.

Patch-to-token mapping. The patch embeddings are projected into the same vector space as the language model's text tokens. Each image patch becomes a "visual token" that the language model attends to alongside text tokens.

Spatial encoding. Position information is encoded so the model knows where each patch appears on the page. This enables spatial reasoning -- understanding that text in the upper-right corner is the invoice number and text at the bottom is the total.

Text decoding. The language model processes visual tokens and text tokens together through its attention mechanism, generating the extracted information or answer.

Resolution and tiling. Higher-resolution images produce more patches and therefore more visual tokens. Some models process the image at a single resolution; others tile the image into sub-images and process each tile separately, then reason across tiles. Tiling enables processing of high-resolution scans without downscaling, but increases the token count and cost.

For a typical document page scanned at 300 DPI, the image might be 2,550 x 3,300 pixels. Depending on the model, this produces anywhere from a few hundred to several thousand visual tokens. The token count directly affects processing time and cost.

Current capabilities across models

The major VLMs each have distinct strengths for document tasks.

Gemini (Google) processes document images natively through Vertex AI. It handles multi-page documents by accepting multiple images in a single request, maintaining cross-page context. Its strength is high throughput and cost efficiency, particularly at the Flash tier, making it practical for batch document processing. It handles tables, charts, and mixed-content pages well.

GPT-4V and successors (OpenAI) deliver strong performance on complex document understanding tasks. They excel at interpreting nuanced layouts, reading degraded text, and handling documents that require significant reasoning -- like extracting implications from contract language alongside reading the text itself.

Claude (Anthropic) performs well on documents that require careful, detailed extraction. Its strength is in long, complex documents where maintaining consistency across many pages matters -- reducing the tendency to hallucinate details that are close to but not exactly what the document says.

All three model families have reached a level where they handle standard business documents (invoices, contracts, forms, reports) reliably. The differences show up on edge cases: degraded scans, unusual layouts, complex nested tables, documents in less common languages.

Limitations

VLMs are not perfect document readers. They have specific, well-documented failure modes.

Numerical precision. VLMs sometimes misread numbers, particularly in dense tables or fine print. A "3" might be read as an "8." A "$1,234,567.89" might lose a digit or transpose two. For financial documents where every digit matters, VLM output should be validated against source material. This is not a theoretical risk -- it happens often enough to be a design consideration.

Resolution constraints. When images are downscaled to fit within the model's input size, fine details are lost. Small footnote text, tiny table cells, and dense spreadsheet printouts may become unreadable. Processing at higher resolution (through tiling) mitigates this but increases cost and latency.

Processing speed. A VLM processes a document page in 2 to 10 seconds, depending on the model tier and image complexity. OCR processes a page in under a second. For batch processing of thousands of pages, the speed difference matters. A 10,000-page archive takes OCR a couple of hours. It takes a VLM 10 to 20 hours at the same scale.

Hallucination. VLMs can generate plausible but incorrect text when the image is ambiguous. A partially visible word might be completed incorrectly. A value that is unclear in the scan might be stated confidently with the wrong number. The model does not flag uncertainty in the way a human would say "I cannot read this." It produces text, and the text might be wrong.

Multi-page coherence. Processing a 50-page document as 50 individual page images loses cross-page context. References like "as stated on page 12" or tables that span two pages require the model to reason across images. Some models handle this by accepting multiple images; others require each page to be processed independently, with a separate synthesis step.

Cost at scale. Visual tokens are expensive relative to text tokens. Processing a document image costs significantly more than processing the equivalent extracted text. For a document where text extraction is sufficient, using a VLM is paying for visual processing you do not need.

Scanned documents: where VLMs shine

Scanned documents are the strongest use case for VLMs over traditional text extraction.

A scanned contract from 2005 might have skewed pages, faded ink, handwritten annotations in the margins, a notary stamp on the signature page, and a table of payment schedules. A traditional OCR pipeline handles the printed text acceptably but fails on the annotations, misreads the stamp, and destroys the table structure. Post-processing to reconstruct the table from OCR output is fragile and error-prone.

A VLM processes the same scanned page and extracts the printed text, reads the handwritten annotations, interprets the notary stamp, and understands the table structure -- all in one pass. The marginal annotation that says "agreed per call 3/15" is captured alongside the printed clause it annotates. The notary details are extracted from the seal. The payment schedule is returned as structured tabular data.

Where VLMs still struggle with scanned documents:

Extremely low resolution. Scans at 72 or 100 DPI compress text into pixel clusters that even VLMs cannot resolve. The minimum for reliable VLM processing is roughly 150 DPI, with 300 DPI being the standard for good results.

Heavy degradation. Documents that have been photocopied multiple times, water-damaged, or partially destroyed may have regions where text is genuinely unrecoverable by any method.

Large-format documents. Architectural drawings, engineering diagrams, and oversized tables that span poster-sized pages are difficult because the image must be either heavily downscaled (losing detail) or tiled into many sub-images (losing global context).

Mixed-content pages

The most compelling VLM capability is handling pages that contain multiple content types simultaneously.

Consider a financial report page that contains: a paragraph of narrative text, a table of quarterly figures, a bar chart showing the trend, a footnote with disclaimers, and a small logo in the header. A text extraction pipeline processes the paragraph and footnote well, struggles with the table, cannot read the chart, and ignores the logo. The chart data -- which might be the most important information on the page -- is completely lost.

A VLM processes the entire page as a single visual input. It reads the narrative, extracts the table in structured format, interprets the chart data, captures the footnotes, and identifies the company from the logo. Each content type is understood in context: the chart visualizes the data in the table, the footnote qualifies the table values, and the narrative summarizes the chart trend.

This unified processing eliminates the need for content-type detection as a preprocessing step. There is no stage where the system decides "this region is a table, send it to the table extractor; this region is a chart, send it to the chart reader." The VLM handles all content types through a single model call.

Cost analysis: VLMs vs traditional pipelines

The cost comparison between VLMs and traditional OCR pipelines is nuanced.

Per-page processing cost. Traditional OCR costs $0.001 to $0.01 per page, depending on the engine and volume. VLM processing costs $0.003 to $0.03 per page for Flash-tier models, and $0.01 to $0.10 per page for Pro-tier models. VLMs are 3x to 10x more expensive per page.

Pipeline complexity cost. Traditional OCR requires post-processing: template maintenance, rule engines, extraction logic, error correction. The engineering cost of building and maintaining this pipeline is substantial. VLMs eliminate the pipeline -- the model output is the extracted data. The engineering cost reduction can offset the higher per-page cost, especially for organizations processing documents from many different sources.

Accuracy cost. Traditional OCR with post-processing produces errors that require human review. The cost of human review -- finding and correcting extraction errors across thousands of documents -- is often the largest cost in a document processing operation. VLMs reduce error rates, which reduces review cost. For organizations where review labor is expensive (legal, financial, compliance), VLM accuracy improvements pay for themselves.

Total cost of ownership. For high-volume, uniform document processing (millions of identical forms), traditional OCR pipelines with well-tuned templates remain cheaper. For mixed-format, variable-layout document processing (hundreds of different vendor invoices, diverse contract formats, varied report structures), VLMs are often cheaper when total cost of ownership -- including engineering, maintenance, and review labor -- is considered.

When to use VLMs vs text extraction

The decision framework is straightforward.

Use text extraction when:

The document is a native digital file (DOCX, HTML, structured PDF) with extractable text
The document has no visual elements that carry information (tables are simple, no charts or images)
Volume is very high (millions of pages) and per-page cost must be minimal
Speed is critical and seconds-per-page latency is unacceptable
The document format is uniform and a template-based approach works reliably

Use VLMs when:

The document is scanned or photographed (no extractable text layer)
The page contains mixed content: text, tables, charts, diagrams, handwriting
Document formats vary widely and template-based approaches break on new formats
Visual context matters: layout, positioning, stamps, signatures, annotations
Accuracy on complex pages justifies the higher per-page cost
You need a single-pass solution without building and maintaining extraction pipelines

Use both when:

You have digital documents with occasional scanned pages or image-heavy sections
Text extraction handles the bulk of pages cheaply, and VLMs handle the complex pages where text extraction falls short
You want text extraction as the fast path and VLMs as the fallback for pages that fail quality checks

docrew uses this hybrid approach in practice. The agent extracts text directly from DOCX and XLSX files using native parsers -- no need for visual processing when the document structure is already available in the file format. For PDFs and scanned documents where text extraction is insufficient, the agent processes pages through Gemini's vision capabilities, getting the spatial understanding and table extraction that text-only approaches miss. The decision is made per-document, based on the file type and the content complexity the agent detects during initial processing.

Where VLMs are heading

The trajectory is toward higher resolution, lower cost, and faster processing.

Resolution improvements will make fine print, dense tables, and detailed diagrams readable without tiling workarounds. Models that process full-resolution page scans natively will eliminate the resolution-accuracy trade-off.

Cost reduction through smaller, specialized document-understanding models will make VLM processing competitive with OCR on a per-page basis. Flash-tier vision models are already approaching this price point for standard business documents.

Speed improvements will close the latency gap with OCR. Batch processing APIs that optimize throughput for high-volume document processing will make VLMs practical for the million-page archives that currently rely on OCR.

Structured output will become more reliable. Models that guarantee valid JSON, valid table structures, and consistent field naming will reduce the post-processing needed on VLM output.

The end state is a single model call that takes a document page -- scanned, photographed, or rendered from a digital file -- and returns structured, accurate, complete data. No OCR preprocessing, no layout analysis, no template matching, no post-processing rules. One image in, structured data out. We are not there yet, but the gap is closing with each model generation.

Back to all articles