Extracting Tables from PDFs: Why It's Still Hard and How AI Solves It
PDF tables are notoriously difficult to extract accurately. Learn why traditional tools fail and how AI-based extraction handles merged cells, spanning rows, and inconsistent layouts.
Why PDF tables are a nightmare
PDFs were designed for display, not data exchange. When a table appears in a PDF, it looks structured to human eyes. Rows, columns, headers, values -- everything is visually organized.
But underneath, a PDF doesn't know it contains a table. There's no table element, no row element, no cell element. A PDF stores text as positioned characters on a canvas. The "table" is just text placed at specific coordinates, with lines drawn between them. Sometimes without the lines.
This means extracting a table from a PDF requires reconstructing structure from visual layout. You have to figure out which text belongs to which column, where rows start and end, which cells span multiple columns, and how headers relate to data.
For simple tables -- uniform grids with clear borders -- this works reasonably well with traditional tools. For everything else, it falls apart.
Where traditional extraction fails
The standard approach to PDF table extraction uses geometric analysis: find the horizontal and vertical lines, identify the grid they form, and assign text to cells based on position.
This breaks in predictable ways:
Borderless tables. Many PDFs use spacing rather than lines to delineate columns. Without borders, geometric analysis has nothing to detect. The tool sees a block of text, not a table.
Merged cells. A header that spans three columns doesn't fit the grid model. Traditional tools either duplicate the header across all three columns, assign it to the first one, or crash.
Multi-line cells. When a cell contains text that wraps to two or three lines, simple extraction tools treat each line as a separate row. A five-row table with wrapped text might extract as twelve rows.
Spanning rows. A subtotal row that spans the description and amount columns. A notes field that spans the entire width below the table. These break the assumption of uniform grid structure.
Inconsistent column alignment. Financial tables where amounts right-align and descriptions left-align within the same column. Numbers in one row might be offset from numbers in the row above.
Multi-page tables. A table that starts on page 2 and continues on page 4, with repeated headers. Traditional tools extract two separate tables with no indication they're related.
Nested tables. A table within a table -- common in forms and complex reports. Geometric analysis produces garbled output.
Libraries like Tabula, Camelot, and pdfplumber handle simple cases well. But the moment a table deviates from a clean grid, accuracy drops sharply. And real-world tables deviate constantly.
How AI changes the equation
AI-based extraction doesn't rely on geometric analysis. It reads the table the way a human reads it -- understanding the visual structure, recognizing headers, and interpreting the relationship between labels and values.
When docrew's agent encounters a table in a PDF, the process is fundamentally different from traditional extraction:
Visual understanding. The agent processes the document's visual layout, recognizing table structures by their appearance rather than by detecting lines. A borderless table with aligned columns is recognized as a table because it looks like one.
Semantic parsing. The agent understands what the table contains. "Revenue" in a header means the column below contains revenue figures. "Q1 2026" means the values are for the first quarter. This semantic understanding enables accurate extraction even when the visual structure is ambiguous.
Context awareness. If a cell is empty, the agent can infer whether it should be null, zero, or the same value as the cell above (a common pattern in grouped tables). It uses the surrounding context to make intelligent decisions.
Structure reconstruction. For multi-page tables, the agent recognizes repeated headers and merges the data into a single table. For nested tables, it preserves the hierarchy. For merged cells, it correctly assigns span information.
The result is extraction that works on tables traditional tools can't handle -- and that works more reliably on simple tables too.
Practical extraction with docrew
Here's how table extraction works in practice with docrew:
Scenario: Financial report with quarterly results.
The PDF contains a 30-row table with columns for line items, Q1 through Q4, and a full-year total. Some rows are subtotals with bold text. The table spans two pages. Several cells contain footnote markers.
With traditional extraction, you'd get two separate tables (one per page), lost formatting (subtotals indistinguishable from regular rows), and footnote markers mixed into the numbers.
With docrew, you tell the agent: "Extract the financial results table into a CSV. Preserve the hierarchy between line items and subtotals. Merge the data across pages. Remove footnote markers from numeric values."
The agent reads both pages, recognizes the table continuation, identifies subtotal rows by context, strips footnote markers, and produces a clean CSV with proper hierarchy.
Scenario: Invoice with line items.
A vendor invoice has a line items table with columns for description, quantity, unit price, and amount. Some descriptions wrap to two lines. The table has no visible borders. Below the table, there's a subtotal, tax, and total.
Traditional tools might split wrapped descriptions into separate rows and miss the subtotal section entirely.
docrew's agent recognizes the wrapped text as a single cell, correctly associates quantities with multi-line descriptions, and extracts the summary fields below the table as separate data points.
Scenario: Survey results in a complex layout.
A research PDF contains a table where some cells contain bar charts (rendered as images), some contain percentage values, and rows are grouped by category with merged header cells.
Traditional extraction produces unusable output. docrew reads the textual content of each cell, correctly handles the merged category headers, and produces a structured output with category groupings preserved.
Output formats for extracted tables
Once a table is extracted, you need it in a usable format. docrew supports several output modes:
CSV. One file per table. Good for spreadsheet import, database loading, or further processing. Flat structure -- merged cells are expanded, hierarchy is represented with indentation or a level column.
JSON. Preserves nested structure. Good for programmatic processing. Each row is an object with column names as keys. Grouped or hierarchical tables maintain their nesting.
Markdown tables. Good for documentation or quick review. Renders cleanly in any markdown viewer. Limited to simple structures -- complex tables may need simplification.
Excel-ready format. The agent can write directly to a format that preserves column types (numbers, dates, text) for clean Excel import without manual type conversion.
Accuracy and edge cases
AI-based table extraction is significantly more accurate than traditional methods, but it's not perfect. Knowing where it struggles helps you plan:
Very wide tables. Tables with 20+ columns sometimes have column alignment issues. Breaking them into logical groups before extraction improves results.
Tables embedded in dense text. When a table is surrounded by paragraphs with similar formatting, the agent occasionally includes surrounding text in the extraction. Explicit instructions ("extract only the table titled...") help.
Handwritten table entries. Scanned documents with handwritten values in printed table cells are harder. Accuracy depends on handwriting legibility.
Rotated or skewed tables. Tables in landscape orientation within a portrait page, or tables in scanned documents with slight rotation. The agent handles moderate skew but extreme rotation can cause errors.
For most business documents -- financial tables, invoice line items, data reports -- AI extraction accuracy exceeds what you'd get from manual transcription. Humans make typos; the AI might misinterpret a layout but doesn't make transcription errors.
From extraction to analysis
Extracting tables is rarely the end goal. You extract tables to analyze the data.
docrew enables a workflow where extraction and analysis happen in the same session. Extract the quarterly results table, then immediately ask the agent to calculate year-over-year growth rates. Extract invoice line items from 50 invoices, then ask for a vendor spending analysis.
This is where the agent model shows its advantage over standalone extraction tools. A tool like Tabula gives you a CSV. You then import the CSV into Excel, write formulas, build charts. With docrew, the agent that extracted the data can also analyze it, compare it across documents, and produce the report you actually need.
The entire workflow -- from PDF to analysis to report -- happens on your device. The tables are extracted locally. The analysis happens locally. The output is written locally. At no point do your financial reports or vendor invoices touch a cloud server.
Getting started with table extraction
If you're currently struggling with PDF tables -- copying values by hand, fighting with Tabula output, or paying for cloud extraction APIs -- try local AI extraction:
- Open docrew and point it at your PDF.
- Tell the agent which table you need extracted and in what format.
- For batch processing, point it at a folder of PDFs and describe the table structure you expect.
- Review the output, flag any issues, and let the agent refine its extraction.
The difference between fighting with PDF tables and having them cleanly extracted is one conversation with an AI agent. And unlike cloud extraction services, your documents never leave your machine.