9 min read

Processing Mixed-Format Documents: PDFs, DOCX, Images in One Workflow

Real document collections mix PDFs, Word files, spreadsheets, and images. Learn how an AI agent processes all formats in a single workflow without format-specific tools.


Documents don't arrive in one format

You just received a due diligence package: 30 PDFs, 15 Word documents, 5 Excel spreadsheets, and 20 scanned images of signed agreements. You need to extract key terms from all of them.

In a traditional workflow, this means four different tools. A PDF extractor for the PDFs. A Word parser for the DOCX files. A spreadsheet reader for the Excel files. An OCR tool for the scanned images. Each tool has its own interface, its own output format, and its own limitations. Combining the results requires manual work.

This is the mixed-format problem. Real document collections are heterogeneous. A single project folder might contain:

  • Native PDFs (text-based)
  • Scanned PDFs (image-based)
  • DOCX files from Word
  • XLSX spreadsheets
  • PNG/JPEG images of documents
  • Occasionally PPTX, RTF, or HTML files

Each format stores content differently. PDFs position text on a canvas. DOCX wraps text in XML. XLSX stores values in cells. Images store pixels. Extracting the same information from each format requires fundamentally different approaches -- if you're using format-specific tools.

The format-agnostic approach

An AI agent doesn't need different tools for different formats. It reads each document at the content level, regardless of the container format.

When docrew processes a mixed-format folder:

For native PDFs: The agent extracts the text layer and processes it. Tables, paragraphs, headers, and footers are all accessible through the text content.

For scanned PDFs and images: The agent uses visual understanding to read the document. It processes the image and extracts text, layout, and structure from what it sees.

For DOCX files: The agent reads the document's semantic structure -- paragraphs, headings, tables, lists -- and processes the content. Word's rich formatting (styles, track changes, comments) provides additional context.

For XLSX files: The agent reads cell values, understands the spreadsheet structure (headers, data rows, formulas), and extracts the requested information.

The output is unified. Whether a contract came as a PDF or a DOCX, the extracted data goes into the same structured output. Whether a financial table was in a spreadsheet or embedded in a PDF, the numbers end up in the same format.

A unified workflow in practice

Scenario: Due diligence document review.

A private equity firm receives a virtual data room with 200 documents across multiple formats. They need to extract key terms from every agreement: parties, effective dates, values, termination provisions, change of control clauses.

Traditional approach:

  1. Sort documents by format (4 piles)
  2. Process PDFs with a PDF tool
  3. Process DOCX files with a Word tool
  4. Process scanned documents with OCR + extraction
  5. Manually handle spreadsheets
  6. Merge all outputs into a single review spreadsheet
  7. Reconcile format differences in the merged output

docrew approach:

  1. Point the agent at the data room folder
  2. Specify: "For each agreement, extract parties, effective date, termination date, contract value, change of control provisions, and governing law"
  3. The agent processes all 200 documents regardless of format
  4. One unified spreadsheet output

No format sorting. No tool switching. No output merging. The agent handles format detection and content extraction as a single operation.

Scenario: Monthly vendor document processing.

An accounts payable department receives documents from vendors in whatever format the vendor prefers. Some send PDF invoices. Others send Word documents. Some email Excel price lists. A few still fax (resulting in scanned images).

With docrew, the workflow doesn't change based on format:

  1. Collect all incoming vendor documents into a folder
  2. Run the extraction: "Extract vendor name, document type, date, amounts, and key terms from each document"
  3. The agent identifies the document type (invoice, price list, statement, contract) and extracts the relevant fields
  4. Output is a single structured dataset regardless of source format

Scenario: Research paper collection.

A researcher downloads papers from multiple sources. Some are journal PDFs. Some are Word manuscripts. Some are presentation slides exported as images. Some are supplementary data in Excel format.

The extraction request: "For each document, extract the title, authors, methodology, and key findings."

docrew processes papers regardless of format, producing a unified literature review database.

Format-specific considerations

While the AI agent handles all formats, understanding format-specific nuances helps you get better results:

PDFs with text layers produce the most accurate extraction. The text is already encoded and the agent reads it directly.

Scanned PDFs and images require visual processing. Accuracy depends on scan quality. High-resolution scans (300+ DPI) produce excellent results. Low-resolution scans or photographs may need review for specific values.

DOCX files often contain the richest content. Track changes, comments, and metadata provide additional context. The agent can access document properties (author, creation date, last modified) that aren't visible in the rendered document.

XLSX files are already structured. The agent reads the cell structure directly. The main challenge is understanding which cells are headers, which are data, and how sheets relate to each other.

Images (PNG, JPEG) of documents work similarly to scanned PDFs. The agent reads the visual content. Quality matters -- a clear photograph works well; a blurry screenshot may not.

docrew's Rust-based parsers handle DOCX and XLSX files natively, extracting semantic content from the Office Open XML format. PDFs and images go through the language model's visual processing. The agent routes each file to the appropriate extraction path automatically.

Handling format detection

The agent doesn't rely solely on file extensions for format detection. A file named "contract.pdf" might actually be a scanned image stored in a PDF container. A file named "data.xlsx" might contain narrative text rather than tabular data.

docrew detects the actual content type:

  • Text-based PDF vs image-based PDF: The agent checks for a text layer. If present, it uses text extraction. If absent (scanned document), it uses visual processing.
  • Structured XLSX vs narrative XLSX: The agent examines the content. If it's tabular data, it processes cells. If it's a text-heavy layout, it reads the narrative content.
  • Valid vs corrupted files: If a file can't be read (password-protected, corrupted, incompatible format), the agent reports the failure and continues with the remaining files.

This intelligent format handling means you don't need to pre-sort or pre-process your document collection. Drop everything into a folder and let the agent figure it out.

Maintaining consistency across formats

The biggest risk in mixed-format processing is inconsistent output. A date extracted from a PDF might look different from the same date extracted from a DOCX file. An amount from a spreadsheet might have different decimal formatting than one from a scanned invoice.

docrew normalizes output regardless of source format:

Dates are standardized to a consistent format. Whether the source was "March 15, 2026" in a Word document or "03/15/26" in a PDF, the output uses your specified format.

Amounts are normalized with consistent decimal places and currency symbols. "$1,234.56" from a PDF and "1234.56" from a spreadsheet cell both produce the same output format.

Names and entities are extracted consistently. "ACME Corporation" from a PDF header and "Acme Corp" from a Word footer are recognized as the same entity.

Field mapping is consistent. The "total" field means the same thing whether it came from a PDF invoice, a Word quotation, or an Excel price list.

This normalization is essential for the output to be useful as a unified dataset. Without it, you'd spend as much time cleaning the merged output as you saved on extraction.

Building repeatable mixed-format workflows

For recurring document processing tasks, define your workflow once and reuse it:

  1. Define the extraction schema -- what fields you need from which types of documents.
  2. Set the output format -- CSV, JSON, or Excel with your preferred structure.
  3. Establish naming conventions -- how output files are named and organized.
  4. Document exceptions -- what to do when files can't be processed.

With docrew, this workflow definition is a natural language instruction that the agent applies to every batch. When new documents arrive -- in any format -- the same instruction processes them consistently.

For teams that process recurring mixed-format document sets (monthly vendor packages, quarterly regulatory filings, annual audit packages), this repeatability is the key benefit. The format of incoming documents can change without affecting your workflow. A vendor that switches from PDF invoices to DOCX invoices doesn't require any update to your processing pipeline.

The single-tool advantage

The operational benefit of processing all formats with a single tool is significant:

Reduced tool sprawl. Instead of four or five specialized tools (PDF extractor, DOCX parser, OCR engine, spreadsheet reader, image processor), you use one.

Simplified training. Team members learn one tool, one workflow, one output format. No need to train on different tools for different document types.

Consistent output. All data goes into the same structured format regardless of source. No format-specific quirks to handle downstream.

Lower maintenance. One tool to update, one tool to support, one tool to integrate with your systems.

Faster processing. No time spent sorting documents by format, switching between tools, or merging outputs. The entire collection goes through a single pass.

docrew delivers this single-tool advantage with the added benefit of local processing. Your mixed-format document collection stays on your device. The extraction happens on your machine. The output is written to your file system. No cloud service sees your documents, regardless of their format.

Back to all articles