9 min read

Real-Time Document Ingestion: From Receipt to Database

Documents arrive continuously -- receipts, invoices, forms. Learn how to build a real-time document ingestion pipeline that extracts and structures data as files land.


Documents don't wait for batch day

A receipt is photographed at a restaurant. An invoice arrives by email. A signed contract is scanned at the front desk. A compliance form is submitted through a portal.

These documents arrive throughout the day, every day. Processing them in a weekly or monthly batch means they sit untouched for days or weeks -- delaying expense reports, slowing accounts payable, creating compliance gaps, and forcing people to remember context that fades with time.

Real-time document ingestion processes documents as they arrive. A receipt is extracted within minutes of being photographed. An invoice is parsed as soon as it hits the inbox. A contract's key terms are in your database before the ink is dry.

The difference between batch and real-time isn't just speed. It's the ability to act on document data immediately rather than waiting for the next processing cycle.

What a document ingestion pipeline looks like

A real-time document ingestion pipeline has five stages:

1. Capture. The document enters the system. This could be a file dropped into a watched folder, an email attachment, a mobile photograph, or a scan from a multifunction printer.

2. Classification. The system identifies what kind of document it is. Invoice, receipt, contract, form, correspondence. This determines what fields to extract.

3. Extraction. The AI reads the document and pulls out the relevant structured data based on the document type. An invoice yields vendor, amount, and line items. A receipt yields merchant, date, and total. A contract yields parties, dates, and key terms.

4. Validation. The extracted data is checked for consistency and completeness. Amounts are verified against line item sums. Required fields are confirmed present. Data types are validated (dates are dates, numbers are numbers).

5. Output. The structured data is written to its destination -- a database, a spreadsheet, an accounting system, or a file system -- ready for downstream use.

In a well-built pipeline, these stages happen automatically. The human touchpoint is reviewing exceptions, not processing the happy path.

Building real-time ingestion with docrew

docrew enables real-time document ingestion through its local agent architecture. Here's how each stage works:

Capture: watched folders. Set up a folder on your computer where incoming documents land. Email attachments get saved here. Mobile photos sync here. Scans are directed here.

Classification: agent intelligence. When a new file appears, the agent reads it and identifies the document type. It doesn't need pre-configured categories -- it understands the document's purpose from its content. An invoice is recognized as an invoice whether it says "Invoice" at the top or not.

Extraction: schema-based. For each document type, you define the extraction schema once. Invoices get invoice fields. Receipts get receipt fields. Contracts get contract fields. The agent applies the appropriate schema based on classification.

Validation: built-in checks. The agent validates extracted data as part of extraction. Math checks, format validation, required field verification, and duplicate detection all happen automatically.

Output: local files or exports. Extracted data is written to your local file system in your preferred format. A running spreadsheet of all processed invoices. A JSON file per contract. A receipt log with daily entries.

The entire pipeline runs on your device. Documents are processed locally. No cloud service receives your receipts, invoices, or contracts.

The receipt-to-database workflow

Let's trace a single document through the pipeline:

12:30 PM. A sales rep photographs a lunch receipt on their phone. The image syncs to the expense folder on their laptop.

12:31 PM. docrew detects the new file. The agent reads the image: it's a restaurant receipt from Marco's Bistro. Date: March 8, 2026. Total: $47.83. Tax: $3.52. Tip: $9.00.

12:31 PM. The agent validates the data: total minus tax minus tip equals the subtotal. The math checks out. The date is today. The merchant name is readable.

12:31 PM. The agent writes a new row to the expense tracking spreadsheet: date, merchant, category (meals/entertainment), subtotal, tax, tip, total, file reference. The receipt image stays in the folder for audit purposes.

12:32 PM. The expense report is updated. The receipt is processed. Total elapsed time from photograph to structured data: under two minutes.

Compare this to the traditional workflow: photograph the receipt, throw it in a pile, spend an hour at the end of the month entering all receipts into an expense report, reconcile mismatches, resubmit corrections. The real-time pipeline eliminates the pile, the hour, and the corrections.

Processing incoming invoices

For accounts payable, real-time ingestion means invoices enter the payment pipeline immediately:

Invoice arrives by email. The attachment is saved to the AP folder.

Agent processes the invoice. Vendor name, invoice number, date, due date, PO number, line items, amounts, payment terms, and bank details are extracted.

Validation runs. Invoice total matches line item sum. Due date is in the future. Invoice number doesn't match any previously processed invoice (no duplicates).

Data is added to the AP ledger. The extracted data is appended to the accounts payable tracking spreadsheet or exported in a format compatible with your accounting system.

Exception handling. If validation fails (math doesn't add up, duplicate invoice number, missing PO reference), the invoice is flagged for review. The flag includes the specific issue, not just a generic error.

The AP team reviews flagged invoices and approves clean ones for payment. The data entry step -- historically the bottleneck -- is eliminated.

Multi-document type ingestion

A real pipeline doesn't process just one type of document. An office receives invoices, receipts, contracts, forms, and correspondence. The ingestion pipeline needs to handle all of them.

docrew handles multi-type ingestion through classification:

  1. New file detected. The agent reads the document.
  2. Classification. The agent identifies it as an invoice, receipt, contract, form, letter, or other type.
  3. Schema selection. Based on the classification, the agent applies the appropriate extraction schema.
  4. Extraction and output. Data is extracted and written to the correct destination.

An invoice goes to the AP spreadsheet. A receipt goes to the expense tracker. A contract goes to the contract database. A form goes to the compliance log. Each document type has its own schema and its own output destination, but they all flow through the same pipeline.

This eliminates the manual sorting step. You don't need to organize incoming documents by type before processing. The agent handles classification and routing automatically.

Handling volume spikes

Real-time ingestion needs to handle uneven volume. Most days, a few documents arrive. At month-end, quarter-end, or during audit season, dozens or hundreds might land in a single day.

docrew handles volume spikes by processing documents sequentially as they arrive. The agent works through the queue at a consistent pace. During low-volume periods, documents are processed within minutes. During spikes, processing takes longer per document because the queue is deeper, but no documents are lost or skipped.

For most small and mid-size operations, the agent's processing speed keeps up with document arrival rates. A document takes 10-30 seconds to process. Even during a spike of 100 documents, the entire queue clears within an hour.

For organizations needing higher throughput, the agent can prioritize by document type (invoices before receipts, contracts before correspondence) to ensure time-sensitive documents are processed first.

Error recovery and audit trails

A production ingestion pipeline needs robust error handling:

File read failures. Corrupted files, unsupported formats, password-protected documents. The agent logs the failure, moves the file to an error folder, and continues processing the queue.

Extraction failures. Documents that can't be classified or where required fields aren't found. The agent creates a partial extraction with available data and flags the document for review.

Validation failures. Data that doesn't pass consistency checks. The extracted data is saved but marked as unvalidated. The specific validation failure is recorded.

Audit trail. Every processed document gets an entry in the processing log: timestamp, file name, document type, extracted fields, validation status, and output destination. This log serves as an audit trail showing when each document was processed and what data was extracted.

docrew writes the audit trail to a local file. No processing data leaves your device. The trail is available for internal audit, compliance review, or debugging.

From ingestion to action

The value of real-time ingestion isn't just having structured data faster. It's the actions that faster data enables:

Faster payments. Invoices processed on arrival can be approved and paid within terms, capturing early payment discounts. Invoices processed monthly often miss discount windows.

Real-time expense tracking. Managers see current spending as receipts are processed, not two weeks later when expense reports are submitted.

Immediate compliance. Compliance documents are logged in real time. Gaps are identified immediately, not during the next audit.

Current contract data. New contracts are indexed as they're signed. The contract database is always current, not weeks behind.

Faster decision-making. When document data is available in real time, decisions based on that data can be made faster. A procurement decision based on current vendor pricing (extracted from recent quotes) is better than one based on last month's data.

Getting started with real-time ingestion

Setting up a real-time document ingestion pipeline with docrew:

  1. Create your input folder. This is where incoming documents will land. Set up email rules, scan destinations, or file sync to route documents here.

  2. Define your document types and schemas. What types of documents do you receive? What fields do you need from each type? Write a clear extraction instruction for each type.

  3. Set up output destinations. Where should the extracted data go? A spreadsheet per document type? A single database? Individual report files?

  4. Start the agent. Point docrew at the input folder. The agent processes new files as they arrive.

  5. Monitor and refine. Review the exception log for the first week. Refine your extraction schemas based on what the agent flags. After the initial tuning, the pipeline runs with minimal intervention.

The result is a document processing pipeline that runs on your device, processes documents in real time, and produces structured data without any file ever leaving your computer. From receipt to database in minutes, not weeks.

Back to all articles