January 31, 20269 min read

How to Process a Folder of Mixed Documents (PDF, DOCX, XLSX)

Project folders contain PDFs, Word docs, spreadsheets, and more. Learn how to point an AI agent at a mixed-format folder and get unified, structured output regardless of source format.

The mixed-format reality

No project lives in a single file format. A construction bid package contains PDF drawings, a Word specification document, and Excel cost estimates. A due diligence data room holds PDF financial statements, DOCX board resolutions, and XLSX cap tables. A regulatory submission folder has PDF forms, Word narrative sections, and spreadsheet data appendices.

Every project folder is a mix. And when you need to extract information or analyze data across that folder, the mix is the problem. PDFs need text extraction (or OCR for scanned documents). Word files need OOXML parsing to preserve structure. Spreadsheets need cell-by-cell reading with awareness of formulas, merged cells, and multiple sheets.

Traditionally, processing a mixed folder means opening each file in its native application, manually finding what you need, and transcribing it into a unified format. For a folder of 40 files across four formats, this is a full day of tedious work before you begin actual analysis.

The alternative: let an AI agent handle the format differences. You describe what you need, and the agent delivers a unified result.

What makes mixed-format processing difficult

The challenge is not just that different formats exist -- it is that the same type of information looks different in each format.

A date in a PDF might be rendered text ("January 15, 2026"), a form field, or part of a scanned image. The same date in a DOCX is editable text, possibly with a date picker field. In an XLSX, it might be a date-typed cell, a text string, or a number (Excel stores dates as numbers internally).

A financial table in a PDF is a grid of positioned text with no actual table structure -- the visual layout implies rows and columns, but the underlying data is just coordinates and characters. The same table in Word has explicit rows and cells with borders and formatting. In Excel, it is native tabular data with types, formulas, and potentially cross-sheet references.

A tool that treats all formats identically misses format-specific information. A tool that requires you to specify handling for each format creates more work than it saves. The useful middle ground is an agent that detects each file's format and applies the appropriate reading strategy while delivering consistent output.

Setting up mixed-format processing in docrew

Here is how to process a folder of mixed documents and produce unified output.

Step 1: Organize your folder. Place all project documents in a single folder. They can be in subfolders -- the agent traverses the directory tree. You do not need to separate files by format.

For this walkthrough, assume a folder called "Riverside Development" containing 35 files: 12 PDFs (site plans, environmental reports, permit applications), 10 DOCX files (contracts, memos, specifications), 8 XLSX files (budgets, schedules, cost estimates), and 5 image files (site photos, survey maps).

Step 2: Describe what you need. The key to effective mixed-format processing is telling the agent what information you want, not how to extract it. Let the agent handle the format-specific reading.

A project summary prompt: "Process all documents in the Riverside Development folder. For each, give me: filename, document type (contract, report, budget, schedule, correspondence, or other), a 2-3 sentence summary, and any key dates, dollar amounts, or deadlines. Write results to a spreadsheet."

A targeted extraction prompt: "Extract all references to the project timeline from every document: start dates, milestone dates, completion dates, and schedule dependencies. Include source document and page or section for each date."

A compliance prompt: "Identify every permit, license, or regulatory approval referenced. For each: permit type, issuing authority, status (approved, pending, expired), and the source document."

Step 3: The agent processes each file. The agent scans the folder, identifies every file, and processes each one using the appropriate method.

For PDFs, it extracts text directly from native PDFs. For scanned PDFs -- where the content is an image rather than embedded text -- it applies character recognition. It preserves layout structure to identify tables, headers, and sections.

For DOCX files, it reads the internal XML structure, preserving headings, numbered lists, tables, and paragraph hierarchy. Findings reference section headings rather than just page numbers.

For XLSX files, it reads each worksheet, interpreting cell values, data types, and layout. It recognizes headers, data rows, totals, and common patterns like budget line items and schedule bars. It handles merged cells and multi-sheet workbooks.

For images, it examines content to determine what information is present -- a photograph of a construction site provides different information than a scanned survey map. The agent extracts readable text and describes visual content.

Processing 35 documents typically takes several minutes, depending on document length and complexity.

Step 4: Unified output. Regardless of source format, the agent produces a single, consistent output. If you asked for a spreadsheet summary, you get one spreadsheet with a row for every document and identical columns. The format differences are absorbed by the agent; the output is uniform.

A practical example: project status review

A project manager needs a status update for the Riverside Development project. The information is scattered across 35 files.

She tells docrew: "Go through all documents in the Riverside Development folder and pull together: the current project timeline with key milestones, total budget and any cost overruns, status of all permits and approvals, and any open issues or risks from recent correspondence. Organize into sections for a status report."

The agent processes the folder. From the XLSX schedule, it extracts milestones and identifies slippage. From the budget spreadsheets, it pulls original budget, change orders, and current spending. From PDF permit documents, it builds a status table. From DOCX memos, it identifies open issues and recent decisions.

The output is a structured report with four sections -- timeline, budget, permits, issues -- with data points sourced from specific documents. The project manager reviews, adds context only she knows, and sends it. A task that would have taken half a day took 15 minutes.

Handling common challenges

Large folders. A folder with 100+ files takes longer but works the same way. Narrow the scope if needed: "Process only documents modified in the last 30 days" or "Process only the PDF and DOCX files, skip the spreadsheets."

Deeply nested subfolders. The agent traverses the full tree by default. Limit scope if needed: "Process only files in the Contracts and Permits subfolders."

Duplicate files. Project folders often contain "Budget_v1," "Budget_v2," "Budget_FINAL," "Budget_FINAL_revised." The agent processes all unless you specify: "For documents with multiple versions, only process the most recent one based on filename or modification date."

Corrupt or unreadable files. Occasionally a PDF is corrupted or a spreadsheet is password-protected. The agent identifies these, reports them as unreadable with the reason, and continues processing the rest. You get a clean result set plus a list of files needing manual attention.

Scanned documents. Scanned PDFs require character recognition, which adds processing time and can introduce minor errors with poor scan quality or handwritten annotations. The agent flags low-confidence extractions for verification.

Organizing the output

How you structure results determines their usefulness.

Document inventory. A spreadsheet with one row per file: filename, format, type, date, summary, and key metadata. A complete map of the folder's contents -- useful as the first step before drilling into specifics.

Extracted data table. Organized by data type rather than by document: all dates in one table, all dollar amounts in another. Each row includes source document and location. Useful for aggregating specific information across the entire folder.

Narrative summary. A written report synthesizing information from multiple documents, citing sources throughout. Useful for executive summaries, status reports, and briefing documents.

Exception report. Inconsistencies between documents, missing information, expired dates, or values outside expected ranges. Useful for audit, compliance, and quality review workflows.

You can ask for any combination. The agent writes output to your local file system in the format you specify -- CSV, XLSX, DOCX, or plain text.

Tips for effective mixed-format processing

Describe the information, not the format. Say "extract all dollar amounts" rather than "parse the PDF tables and read the Excel cells." The agent knows how to handle each format. Your job is to define what matters.

Start with an inventory. For unfamiliar folders, begin with "List every file with its type, date, and a one-line summary" to get the lay of the land before requesting specific extractions.

Be explicit about scope. "Process all documents" is clear. "Process the important ones" is not -- the agent does not know your definition of important. If you mean contracts and budgets only, say that.

Specify handling for missing data. "If a document doesn't mention a completion date, mark it as 'Not specified' in the output." This prevents gaps in your results from being ambiguous.

Use follow-up prompts. After initial processing, ask targeted questions: "Which documents mention the environmental assessment?" or "What is the total across all budget spreadsheets?" The agent has already read every file, so follow-ups are fast.

Process regularly. Project folders grow over time. Run a processing pass after each batch of new documents arrives. The agent can process only new files and append to existing output, keeping your extracted data current.

Mixed-format folders are the norm for any real project. Rather than spending hours opening files one by one and manually assembling information across formats, you describe what you need and let the agent handle the format differences. The output is unified, source documents stay untouched on your machine, and the time savings compound with every folder you process.

Back to all articles