8 min read

How to Extract Data from 100 PDFs Without Uploading Them

Bulk PDF extraction doesn't require cloud uploads. Learn how to extract data from hundreds of PDFs locally using an AI agent that reads files on your device.


The problem with bulk PDF extraction

You have 100 PDFs. Maybe they're invoices, contracts, research papers, or compliance documents. You need specific data pulled from each one -- amounts, dates, names, clauses, or tables. Doing it manually takes hours. Doing it with cloud AI means uploading every file to a third-party server.

Neither option is acceptable when the documents contain sensitive information. Client contracts, financial records, medical forms, employee data -- these files shouldn't leave your device just because you need a spreadsheet from them.

The standard workflow forces a false choice: spend hours copying data by hand, or hand your files to a cloud service and hope their privacy policy holds. There's a third option that most people don't consider: extract the data locally, on your own machine, using an AI agent that never uploads the files.

How local PDF extraction works

Local extraction means the AI reads your files directly from your hard drive. No upload, no cloud storage, no API endpoint receiving your documents.

Here's what happens when docrew processes a folder of PDFs:

Step 1: File discovery. You point the agent at a folder. It scans the directory and identifies all PDF files. No file leaves your computer.

Step 2: Text extraction. The agent reads each PDF and extracts the raw text content. For standard PDFs (not scanned images), this is a direct text extraction -- fast and accurate. For scanned documents, the agent applies visual understanding to read the content.

Step 3: Structured analysis. The extracted text is sent to the language model for analysis. The model identifies the data points you specified -- invoice numbers, amounts, dates, vendor names, whatever you need. Only the text reaches the model, never the original file.

Step 4: Output assembly. The agent compiles the extracted data into a structured format -- CSV, JSON, or a formatted table -- and writes it to your local file system.

The critical distinction: your PDF files never leave your device. The language model sees text extracted from the documents, processes it, and returns structured data. The files themselves stay exactly where they are.

Setting up a batch extraction

The practical workflow for extracting data from 100 PDFs with docrew looks like this:

Define your extraction target. Tell the agent what you need. Be specific: "Extract the invoice number, date, vendor name, total amount, and line items from each PDF in the invoices folder." The more precise your instruction, the more consistent the output.

Point to the folder. Specify the directory containing your PDFs. The agent handles recursive scanning if you have subfolders.

Let the agent work. docrew processes each file sequentially or in parallel depending on complexity. For straightforward extractions (pulling consistent fields from similar documents), the agent processes files quickly. For complex documents requiring interpretation, it takes more time per file but maintains accuracy.

Review the output. The agent produces a consolidated file with all extracted data. You review it, make corrections if needed, and you're done.

For 100 invoices with standard fields, the entire process takes minutes rather than the hours required for manual extraction. And at no point did a single file leave your machine.

Handling inconsistent document formats

Real-world document collections are messy. Your 100 PDFs probably aren't uniform. Some are generated PDFs with clean text layers. Others are scanned images. Some have two-column layouts. Others have tables that span multiple pages.

This is where AI extraction differs fundamentally from template-based or regex-based approaches. Traditional extraction tools need a template for each document format. If invoice A has the total in the top-right corner and invoice B has it at the bottom, you need two templates. With 100 documents from 30 different vendors, you might need 30 templates.

An AI agent doesn't need templates. It reads each document the way a human would -- understanding the layout, recognizing field labels, interpreting context. "Total Amount" and "Amount Due" and "Grand Total" all mean the same thing to the model. A date formatted as "03/15/2026" and "March 15, 2026" and "15-Mar-26" all get normalized correctly.

docrew handles format inconsistency by analyzing each document individually. The agent reads the full content, identifies the requested data points based on meaning rather than position, and extracts them consistently regardless of the source format.

When extraction gets complex

Simple field extraction -- names, dates, amounts -- works reliably across most document types. But some extraction tasks require more intelligence:

Multi-page tables. A table that starts on page 3 and continues on page 5 (with a header repeated on page 5). The agent needs to recognize the continuation and merge the data correctly.

Nested data. An invoice with line items, where each line item has sub-items. The output needs to preserve the hierarchy.

Cross-reference data. A contract that references terms defined in an appendix. The agent needs to pull from both sections to give complete answers.

Conditional fields. A form where certain fields only appear if a checkbox is marked. The agent needs to handle missing fields gracefully.

docrew's agent handles these cases by reading the entire document before extracting data. It builds an understanding of the document's structure, then extracts the requested information with that context. This is different from page-by-page extraction tools that lose context at page boundaries.

Output formats and downstream use

Extracted data is only useful if it's in a format your downstream tools can consume.

docrew can output extracted data as:

CSV files for spreadsheet import. Each row represents a document, each column a field. This is the most common format for bulk invoice or receipt processing.

JSON for programmatic use. If you're feeding the extracted data into another system, JSON preserves structure and nesting better than flat CSV.

Formatted tables in markdown or plain text for quick review. Useful when you need to scan results before committing to a structured format.

Individual reports per document. Instead of a consolidated file, the agent can write a summary file alongside each source PDF. Useful for contract review where each document needs its own analysis.

The agent writes these files directly to your file system. No download step, no export button, no cloud intermediary. The output appears in a folder on your machine.

Accuracy and verification

No extraction system is 100% accurate. The question is how you verify and correct.

With cloud extraction services, verification means downloading the results and comparing them against the source documents -- which you also need to download if they're stored in the cloud. It's a round-trip that adds friction.

With local extraction, both the source files and the output are on your machine. You can open a PDF and the corresponding row in the spreadsheet side by side. Corrections are immediate.

docrew improves accuracy through several mechanisms:

Confidence flagging. When the agent isn't sure about an extracted value, it flags it. You can filter the output for flagged entries and review only those.

Contextual extraction. The agent uses surrounding text to validate extracted values. If an invoice total doesn't match the sum of line items, the agent notes the discrepancy.

Consistent prompting. The agent uses the same extraction logic across all documents in a batch, reducing inconsistency between files.

For most standard extraction tasks -- invoices, receipts, form data -- accuracy exceeds 95% on the first pass. Complex documents (handwritten notes, degraded scans, unusual layouts) may require more review.

Privacy implications of local extraction

The privacy benefit of local extraction isn't just theoretical. It has concrete operational implications:

No DPA required. When no files leave your device, you don't need a Data Processing Agreement with the extraction service for the file content.

No data retention risk. Cloud services retain uploaded files according to their policies. Local extraction means your files exist only on your device, under your retention policies.

No cross-border data transfer. If you're in the EU processing documents containing personal data, local extraction means no data leaves your jurisdiction. The language model processes text transiently -- the documents themselves never cross a border.

Audit simplicity. When regulators ask how you process documents, "locally on employee devices, files never uploaded" is a much simpler answer than explaining a cloud processing pipeline with sub-processors.

For regulated industries -- legal, healthcare, finance -- local extraction isn't just convenient. It's a compliance advantage that eliminates entire categories of risk.

Getting started

If you're currently processing PDFs by uploading them to cloud tools or copying data manually, the switch to local extraction is straightforward:

  1. Install docrew on your desktop.
  2. Point the agent at your folder of PDFs.
  3. Describe what data you need extracted.
  4. Let the agent process the files locally.
  5. Review the output on your machine.

No cloud account setup, no API key management, no file upload permissions. Your documents stay on your computer, and the extracted data appears right next to them.

The time savings scale linearly with volume. Extracting data from 10 PDFs saves an hour. Extracting from 100 saves a day. Extracting from 1,000 saves a week. And at every scale, the files never leave your device.

Back to all articles