How to Set Up Cross-Document Search with AI
Stop relying on keyword grep to find what matters. Learn how to use AI-powered cross-document search to locate every mention of a topic, clause, or entity across dozens of files -- including synonyms, paraphrases, and indirect references.
The problem with searching across documents
You have a folder of 30 documents. Contracts, memos, policy updates, vendor agreements. Somewhere in those files is every reference to indemnification -- but the word "indemnification" only appears in half of them. The rest use "hold harmless," "liability limitation," "damages cap," or describe the concept without naming it at all. One document buries the relevant clause inside a broader section on risk allocation without ever using the word you searched for.
Traditional search -- whether built into your OS, your document management system, or a simple grep -- operates on exact string matching. It finds the literal characters you typed. If your search term is "indemnification," it misses "indemnify," "hold harmless," "limitation of liability," and every sentence that describes the concept in plain language without using the legal term.
This gap between what you search for and what you need to find is the fundamental limitation of keyword-based document search.
Why keyword search fails on real document sets
Consider a legal operations team auditing 30 vendor agreements for indemnification exposure. A keyword search for "indemnif" (truncated to catch variations) returns hits in 18 of 30 documents. They miss the rest:
Synonyms. Four documents use "hold harmless" instead of "indemnify." Same legal concept, different phrasing. A keyword search for one does not find the other unless you run a second search.
Paraphrases. Three documents describe the obligation in plain English: "Vendor shall be responsible for all costs arising from..." This is an indemnification clause in substance, but no keyword search finds it because it does not use any of the expected terms.
Indirect references. Two documents reference indemnification by section number in another agreement: "Subject to Section 8.3 of the Master Agreement." Keyword search within a single file cannot follow that trail.
Negations and exclusions. Three documents explicitly exclude indemnification: "Notwithstanding the foregoing, neither party shall be required to indemnify..." A keyword search returns these alongside actual indemnification obligations. The searcher must manually distinguish between clauses that grant indemnification and clauses that deny it.
Running multiple keyword searches -- indemnify, hold harmless, liability, damages -- produces a flood of results, most irrelevant, while still missing the paraphrased and indirect references.
How AI search changes the approach
AI-powered document search operates on meaning, not string matching. When you ask the agent to find references to indemnification, it understands the concept and locates every passage that relates to it, regardless of the specific words used.
This is not a fancier keyword search with a thesaurus bolted on. The agent reads each document, comprehends the content, and evaluates whether a given passage relates to your search intent. It distinguishes between a clause that grants indemnification, one that limits it, and one that merely mentions it in passing. It follows cross-references. It identifies paraphrased obligations. It understands negations.
The result is a set of findings that maps to your actual question, not to a list of string matches.
Setting up cross-document search in docrew
Here is how to run an AI-powered search across a folder of documents using docrew.
Step 1: Organize your documents. Place all the files you want to search into a single folder on your machine. They can be PDFs, DOCX files, or a mix. The agent reads every file in the folder, so include only the documents relevant to your search scope. If you have subfolders, the agent can traverse those too -- just mention it in your prompt.
Step 2: Describe your search. Tell the agent what you are looking for in plain language. Be specific about both the topic and what kind of information you want returned.
A strong search prompt: "Search all documents in the Vendor Agreements folder for references to indemnification, including any clauses that limit or expand liability. For each match, give me the document name, section or page number, the exact quote, and a brief note on whether the clause grants, limits, or excludes indemnification."
This prompt defines the concept, expands the scope (including liability limitations), specifies the output format (document name, location, quote, classification), and asks for an analytical layer (grant, limit, or exclude). The agent uses all of this to shape its search.
Step 3: Let the agent work. The agent reads every document in the folder, identifies passages that relate to your search concept, and extracts the relevant information. For 30 documents, this typically takes a few minutes depending on document length and complexity.
Step 4: Review the results. The agent returns structured findings. Each result includes the document name, specific section or page, exact text, and the agent's classification. A typical result:
"Meridian Supply Agreement (2024).pdf -- Section 7.2, page 12 -- 'Vendor shall indemnify and hold harmless Client from and against all claims, damages, and expenses arising from Vendor's negligence or willful misconduct.' -- Grants indemnification, limited to negligence and willful misconduct."
Another result might be:
"Apex Logistics Contract.docx -- Section 5, page 8 -- 'Each party shall be solely responsible for any and all costs, damages, or liabilities arising from its own performance under this Agreement.' -- Mutual responsibility clause, functions as reciprocal indemnification without using the term."
That second result would never surface in a keyword search. The agent found it because it understood the legal substance, not because it matched a string.
Refining your search
The first pass gives you a broad view. From there, you can narrow based on what you find.
Narrow by clause type. "Show me only the documents where indemnification is mutual versus one-directional." The agent re-examines its findings and categorizes each clause by directionality.
Narrow by scope. "Which of these indemnification clauses include coverage for intellectual property infringement?" The agent filters results to show only clauses extending to IP-related claims.
Narrow by limitation. "Are there any caps on indemnification liability? List every document where the obligation is capped at a dollar amount or a multiple of fees paid." The agent identifies monetary caps, percentage-based limits, and time-based restrictions.
Compare across documents. "Create a comparison table showing the indemnification terms across all 30 agreements: who indemnifies whom, what's covered, what's excluded, and any caps." The agent produces a structured comparison that would take a paralegal days to compile manually.
Follow cross-references. "Document 14 references Section 8.3 of the Master Agreement. Read the Master Agreement and tell me what Section 8.3 says." If the Master Agreement is in the same folder, the agent follows the reference and reports back.
Each refinement builds on the previous search. The agent retains context from the initial scan, so follow-up queries execute faster.
Handling different document types
Cross-document search works across formats because the agent adapts its reading approach to each file type.
PDF documents. The agent extracts text from native PDFs directly. For scanned PDFs (images of text), it applies optical character recognition to read the content before searching.
Word documents. The agent reads DOCX files natively, preserving headings, sections, and paragraph numbering, so results can reference specific sections by heading rather than just page numbers.
Spreadsheets. If your folder includes Excel files -- a schedule of contracts, a risk register, a compliance checklist -- the agent reads those too and can cross-reference findings against the spreadsheet data.
Mixed folders. A folder with PDFs, Word documents, and Excel files is handled seamlessly. You do not need to separate files by format.
When to use cross-document search
Regulatory audit. Verify that a specific requirement is addressed across all relevant policies and procedures, including indirect compliance through related controls.
Due diligence. Search a data room with dozens of documents for every reference to change-of-control provisions, assignment restrictions, or consent requirements, regardless of phrasing.
Contract portfolio analysis. Understand your aggregate exposure for a specific risk type across 50 vendor agreements, with a consolidated view of your position.
Knowledge management. Find everything your organization has written about a specific topic across hundreds of memos, reports, and analyses -- searched by meaning, not just keywords.
Compliance mapping. Map specific regulatory requirements to the documents and controls that satisfy them. The agent reads both the regulatory text and your internal documents, identifying where each requirement is addressed.
Tips for effective cross-document search
Be specific about what you want. "Find everything about liability" is too broad. "Find every clause that limits either party's total liability to a dollar amount or formula" delivers targeted results.
Specify your output format upfront. Telling the agent you want a table with document name, section, quote, and classification saves reformatting later.
Start broad, then narrow. Run the initial search across all documents, review the landscape, then drill into specific areas. This is faster than running multiple narrow searches that might miss something.
Include context documents. If your agreements reference a master agreement or terms and conditions, include those files in the folder so the agent can follow cross-references when the referenced documents are available.
Save the output. Ask the agent to write results to a CSV, spreadsheet, or structured report. This gives you a permanent record and a starting point for the next review cycle.
Cross-document search with docrew replaces the manual process of opening each file, running keyword searches, reading surrounding context, and compiling results. The agent does all of that in a single pass, finds results that keyword search would miss, and produces organized output ready for review. For anyone who regularly needs to search across collections of documents, this is the difference between spending hours on a search and getting a comprehensive answer in minutes.