Vector Search vs Keyword Search for Document Collections
A technical comparison of vector search and keyword search for document collections -- how each works, where each excels, and when hybrid search is the right answer.
Two ways to find what you need
You have a collection of documents. You need to find the ones that answer a specific question. There are two fundamentally different approaches to this problem.
Keyword search finds documents that contain the words you searched for. It is the technology behind every search bar you have ever used. Type a word, get documents that contain that word.
Vector search finds documents that are semantically similar to your query. It converts both your query and your documents into mathematical representations and measures the distance between them. Documents that are "close" to your query are returned, even if they share no words in common.
Both approaches are mature, well-understood, and widely deployed. Neither is universally better. Each has failure modes that the other handles well. Understanding the mechanics of each -- and where they break down -- is essential for building effective search over document collections.
How keyword search works
Keyword search is built on a data structure called an inverted index. An inverted index maps every word in the collection to the list of documents that contain it. If the word "indemnification" appears in documents 3, 17, and 42, the index stores that mapping. When you search for "indemnification," the index returns documents 3, 17, and 42 in constant time.
Building the inverted index involves tokenizing documents into terms, normalizing those terms (stemming "payments," "paying," and "paid" to "pay"), and storing each term with its document locations. At query time, the same normalization is applied to the search terms.
Ranking determines the order of results. The two dominant algorithms are:
TF-IDF (Term Frequency - Inverse Document Frequency). A term's importance in a document is proportional to how frequently it appears in that document (TF) and inversely proportional to how many documents in the collection contain it (IDF). A word that appears 10 times in one document but in only 2 of 500 documents is highly relevant to that document. A word that appears in 450 of 500 documents (like "the") is not informative.
BM25 (Best Match 25). An evolution of TF-IDF that adds document length normalization and adjustable parameters. BM25 penalizes very long documents that might match many terms simply because they contain more text. It also has a saturation function for term frequency -- the 20th occurrence of a word does not increase relevance as much as the 2nd occurrence. BM25 is the default ranking algorithm in Elasticsearch, Solr, and most modern search engines.
Both algorithms are fast. An inverted index lookup is O(1) per term. Ranking a few hundred candidate documents takes microseconds. Keyword search can process millions of documents per second on modest hardware.
How vector search works
Vector search operates on a completely different principle. Instead of matching words, it matches meaning.
The process starts with an embedding model -- a neural network that converts text into a fixed-length vector (array of floating-point numbers). A typical embedding model might produce a 768-dimensional vector for any input text, whether it is a single word or a full paragraph.
The critical property of these vectors is that semantically similar texts produce similar vectors. "The contract expires on December 31" and "This agreement terminates at year-end" will have vectors that are close together in the 768-dimensional space, even though they share almost no words.
Similarity between vectors is typically measured by cosine similarity: the cosine of the angle between two vectors. Vectors pointing in the same direction (cosine similarity close to 1.0) represent similar meaning. Vectors pointing in perpendicular directions (cosine similarity close to 0.0) represent unrelated meaning.
Indexing for vector search means storing document vectors in a structure that supports fast similarity lookup. Brute-force search works for small collections but does not scale. Approximate Nearest Neighbor (ANN) algorithms provide near-exact results in a fraction of the time. HNSW graphs build a multi-layer navigable structure with sub-millisecond search on millions of vectors. IVF partitions the vector space into clusters, searching only the nearest ones. Product Quantization compresses vectors to reduce memory usage at the cost of some accuracy.
Vector search is more computationally expensive than keyword search. Embedding a query takes 5-50ms. ANN search adds 1-10ms. Fast in absolute terms, but an order of magnitude slower than keyword search.
What keyword search is good at
Keyword search excels in several scenarios that are common in document processing:
Exact term matching. When you know the exact term you are looking for -- a contract number, a party name, a specific clause reference like "Section 7.2(b)" -- keyword search finds it instantly and precisely. There is no ambiguity. The term is either in the document or it is not.
Known vocabulary queries. When users search using the same terminology that appears in the documents, keyword search works perfectly. A lawyer searching for "force majeure" in contracts will find every instance. The vocabulary match is exact.
Boolean queries. Keyword search naturally supports AND, OR, and NOT operations. "Find documents containing 'indemnification' AND 'limitation of liability' but NOT 'consequential damages'" is a straightforward Boolean query on an inverted index. Vector search does not natively support exclusion or precise Boolean logic.
Phrase matching. "Liquidated damages" as a phrase means something specific. Keyword search can match this exact phrase, distinguishing it from documents that mention "liquidated" and "damages" separately in unrelated contexts. Vector search treats the phrase as a bag of semantic meaning and may return documents about damages more broadly.
High-recall structured queries. When you need to find every document that mentions a specific term -- not the most relevant ones, but all of them -- keyword search provides exhaustive results. The inverted index is a complete inventory. If the word is indexed, it will be found.
Speed. Keyword search on an inverted index is fast enough that users perceive it as instant. Sub-millisecond response times on collections of millions of documents are routine. This matters for interactive applications where users type and expect immediate results.
Simplicity. An inverted index is straightforward to build, debug, and maintain. When search results are wrong, the debugging process is transparent: check what tokens were generated, check what the index contains, check the ranking scores. The system is deterministic and inspectable.
What vector search is good at
Vector search handles the cases where keyword search fails:
Semantic matching. The user asks about "early termination penalties" but the contract says "fees incurred upon premature cessation of this agreement." There is no keyword overlap. Vector search finds this because both expressions occupy similar regions in the embedding space. They mean the same thing, and the embedding model knows it.
Synonym handling. "Vendor," "supplier," "service provider," and "contractor" may all refer to the same party in different documents. A keyword search for "vendor" misses documents that use "supplier." A vector search for "vendor" finds all of them because the embedding model understands they are semantically related.
Conceptual queries. "Find clauses that limit our ability to work with competitors." This is not a keyword -- it is a concept. The relevant clauses might use terms like "non-compete," "exclusivity," "restrictive covenant," or "competitive restriction." Vector search can match the concept across all these phrasings.
Multilingual retrieval. Multilingual embedding models produce similar vectors for the same meaning expressed in different languages. A query in English can retrieve relevant documents in French, German, or Spanish. Keyword search requires the query and the document to share the same language (or explicit translation).
Fuzzy matching without configuration. Keyword search needs explicit configuration for fuzzy matching: edit distance thresholds, synonym dictionaries, stemming rules. Vector search gets approximate matching for free as a property of the embedding space. Misspellings, abbreviations, and variant phrasings are handled implicitly.
Natural language queries. Users can ask questions in natural language rather than constructing keyword queries. "What happens if we miss a payment deadline?" retrieves relevant clauses even though the documents might use formal legal language that shares no words with the question.
Where each breaks down
Neither approach is without failure modes. Understanding these failures is essential for choosing the right approach -- or combining both.
Keyword search fails on:
Paraphrases. If the user's words and the document's words do not overlap, keyword search returns nothing. This is the fundamental limitation: keyword search matches tokens, not meaning.
Implicit information. "Which contracts allow renewal?" requires understanding that absence of a non-renewal clause might imply auto-renewal. Keyword search cannot reason about what a document does not say.
Domain mismatch. When users and document authors use different vocabularies -- a business user asking about "penalties" when the contracts say "liquidated damages" -- keyword search misses the connection.
Vector search fails on:
Exact identifiers. Searching for contract number "MSA-2024-0847" with vector search may return documents with similar-looking but different contract numbers. The embedding model does not treat the string as a unique identifier -- it treats it as a sequence of subwords with semantic meaning.
Negation. "Contracts that do NOT contain an arbitration clause" is not a vector search query. Vector search finds documents similar to the query about arbitration, which is the opposite of what is needed. Negation requires Boolean logic, which is keyword search's strength.
Rare technical terms. Embedding models are trained on large text corpora. Terms that are rare in the training data -- highly specialized jargon, product-specific abbreviations, internal acronyms -- may not be embedded accurately. The model has not seen enough examples to learn their meaning.
Precision on specific terms. When the user searches for a specific word and means only that word, vector search may over-generalize. Searching for "LIBOR" might return documents about interest rates generally, diluting the results with content about prime rates, SOFR, and other benchmarks.
Short, ambiguous queries. "Damages" could mean property damage, monetary damages, liquidated damages, or consequential damages. An embedding model produces a single vector that represents some average of these meanings. Keyword search at least returns all documents containing the word, letting the user disambiguate.
Hybrid search: combining both approaches
The complementary failure modes of keyword and vector search make a strong case for combining them. Hybrid search runs both approaches on the same query and merges the results.
The most common merging strategy is Reciprocal Rank Fusion (RRF). Each search approach produces a ranked list of results. RRF combines the lists by scoring each document based on its rank in each list:
For each document, the RRF score is the sum of 1/(k + rank) across all lists, where k is a constant (typically 60). A document that ranks #1 in both lists gets the highest combined score. A document that ranks #1 in one list and does not appear in the other still gets a reasonable score. This fusion does not require the scores from each list to be comparable -- it works purely on ranks.
Weighted scoring is an alternative. Normalize the scores from each approach to the same range and take a weighted average. This lets you tune the balance: 70% keyword, 30% vector, for example. The weights depend on the query type and the collection.
Query-dependent weighting goes further. Some queries are clearly keyword queries ("MSA-2024-0847") and some are clearly semantic queries ("what are our obligations if the supplier fails to deliver"). A classifier can detect the query type and adjust the weights accordingly: pure keyword for identifier lookups, pure vector for conceptual questions, balanced for everything in between.
Hybrid search consistently outperforms either approach alone in benchmarks. The improvement is most significant on diverse query sets where some queries are keyword-friendly and others are semantic. The cost is complexity: you maintain two indexes, run two searches per query, and implement a fusion algorithm.
Document-specific challenges
Document collections present challenges that generic search benchmarks do not capture.
Domain vocabulary. Legal documents use terminology that embedding models may not have seen enough of during training. "Representations and warranties," "pari passu," "force majeure" -- these terms have precise legal meanings that a general-purpose embedding model may not fully capture. The vectors for "pari passu" might be close to general financial terms rather than to its specific legal meaning of equal treatment.
Abbreviations and acronyms like "SOW" (Statement of Work) and "LOI" (Letter of Intent) are meaningful within their document context but opaque to both search approaches without additional processing. Structural elements -- tables, headers, section numbers, cross-references -- carry meaning that is lost when documents are flattened to text. Multi-part documents reference exhibits and amendments in separate files, and neither search approach naturally follows cross-document references. Version confusion means multiple drafts contain similar content with subtle differences. Length variation creates imbalance: a one-page NDA and a 200-page merger agreement present very different search challenges.
These challenges require a combination of better preprocessing, structure-aware chunking, and domain-aware retrieval strategies.
Practical comparison: same query, different results
Consider the query: "What are the consequences of late payment?"
Keyword search results (BM25):
- Contract A, Section 8.3: "Late payment shall incur interest at a rate of 1.5% per month." -- Matches "late" and "payment" directly.
- Contract C, Section 12: "In the event of late payment, the Supplier may suspend services." -- Matches "late" and "payment" directly.
- Contract F, Section 5.1: "Payment is due within 30 days. Late submissions will be subject to a $500 penalty." -- Matches "late" and "payment" but in separate sentences.
Missed by keyword search:
- Contract B, Section 9: "Overdue amounts shall bear interest at the Default Rate." -- Uses "overdue" instead of "late payment."
- Contract D, Section 11.4: "Failure to remit amounts when due shall constitute a material breach." -- Describes the same concept with entirely different vocabulary.
Vector search results:
- Contract B, Section 9: "Overdue amounts shall bear interest at the Default Rate." -- Semantically similar to "late payment consequences."
- Contract A, Section 8.3: "Late payment shall incur interest at a rate of 1.5% per month." -- Both semantically and lexically relevant.
- Contract D, Section 11.4: "Failure to remit amounts when due shall constitute a material breach." -- Semantic match on the concept.
Missed by vector search:
- Contract E, Exhibit A, Fee Schedule: A table with a row "Late Payment Fee: $500 per incident." -- The table format and short text produce a weak embedding.
Hybrid search results (RRF):
Combines results from both lists. Contracts A, B, C, and D all appear. The table from Contract E might still be missed unless the chunking preserved enough context for the embedding to capture its relevance.
This example illustrates the complementary strengths. Keyword search finds exact matches. Vector search finds paraphrases. Hybrid search finds both. Neither finds everything -- the table remains a challenge for both approaches.
When to use what
The right search approach depends on your specific context:
Use keyword search when:
- Users search for exact terms, identifiers, or known phrases.
- The collection uses consistent vocabulary (single-author, single-domain).
- Boolean queries are important (AND, OR, NOT, phrase matching).
- You need exhaustive recall for specific terms (find every document that mentions "arbitration").
- Simplicity matters -- you want a system that is easy to build, debug, and maintain.
- The collection is very large and search speed is critical.
Use vector search when:
- Users ask questions in natural language rather than keyword queries.
- Documents use varied vocabulary to express similar concepts (multi-author, multi-source collections).
- Synonym and paraphrase matching is important.
- The collection spans multiple languages.
- Users do not know the exact terminology used in the documents.
Use hybrid search when:
- The query mix is diverse -- some keyword-style, some semantic.
- You cannot predict how users will search.
- Maximum recall is more important than implementation simplicity.
- You are building a production system where search quality directly affects user outcomes.
Cost and complexity
Keyword search is simpler and cheaper. An inverted index requires storing terms and document IDs. No embedding model, no GPU, no vector storage. BM25 runs on CPU in microseconds. The entire system can be implemented in a few hundred lines of code or by deploying SQLite with FTS5.
Vector search adds significant complexity: an embedding model to host, vector storage to manage, documents to chunk before embedding. The vectors themselves consume storage -- 768-dimensional float32 vectors at 3KB each add up. Hybrid search multiplies this further with two indexes, two search paths, and a fusion algorithm to tune.
The cost-quality trade-off is real. Keyword search is good enough for many use cases, especially when users know the terminology. Vector search is worth the complexity when semantic matching matters. For teams evaluating their options: start with keyword search, measure where it fails, and add vector search to address the specific failure modes you observe.
Where search fits in the broader picture
Search -- whether keyword, vector, or hybrid -- is a retrieval mechanism. It finds relevant content. But finding content is only the first step in document processing. The real work happens after retrieval: analysis, comparison, extraction, summarization.
This is where the interaction between search and language models becomes important. Search narrows the collection to the relevant subset. The language model reasons about that subset to produce the answer. The quality of the search determines whether the model has the right information to work with.
docrew takes this a step further by making the search itself part of the agent's reasoning loop. Instead of a fixed search pipeline that runs before the model sees anything, the agent uses file_grep as a tool -- a search operation it can invoke, evaluate, and refine as part of its reasoning process. The agent searches, reads the results, decides whether it has enough information, and searches again if needed. This iterative approach handles the ambiguity and complexity of real document collections better than a single-shot retrieval pipeline.
The distinction matters. A fixed pipeline commits to a search strategy before the model reasons. An agent-driven search adapts the search strategy based on the model's understanding of the task. When the first search does not return what is needed, the agent reformulates and tries again. This flexibility is particularly valuable for document processing, where queries are often ambiguous and the relevant content is not always where you expect it.
Search technology will continue to improve -- better embedding models, faster indexes, smarter fusion algorithms. But the fundamental tension between keyword precision and semantic recall will persist. The question is not which approach wins. The question is how to combine them effectively for the specific documents, queries, and accuracy requirements you face.