16 min read

Building Multi-Model Pipelines: Router, Worker, Synthesizer

How to architect multi-model AI pipelines using a router-worker-synthesizer pattern -- reducing costs by 5-10x while maintaining output quality through intelligent task routing.


One model does not fit all

The default approach to building an AI product is simple: pick the best model, send everything to it, done.

This works until you look at the bill.

Top-tier language models are expensive. Gemini Pro, GPT-4, Claude Opus -- these models deliver excellent results on complex tasks. They also deliver excellent results on trivial tasks, at the same per-token price. Using a Pro-tier model to classify a document as "invoice" or "contract" is like hiring a senior attorney to sort mail. The work gets done. The cost-to-value ratio is absurd.

The economics of AI processing create a natural pressure toward specialization. Not every task needs the most capable model. Most tasks, in fact, do not. The distribution in real-world workloads is heavily skewed: 70 to 80 percent of requests are simple -- classification, formatting, short extraction, straightforward summarization. 15 to 20 percent are moderate -- multi-step analysis, comparison, structured extraction from complex documents. Only 5 to 10 percent are genuinely hard -- multi-document synthesis, nuanced reasoning, complex analytical work.

If you send everything to Pro, you are paying Pro prices for the 80 percent of tasks that a model costing one-tenth as much handles equally well.

The multi-model pipeline solves this by matching model capability to task complexity. The pattern has three stages: a router that classifies the task, workers that execute it, and a synthesizer that combines the results.

The three-stage pattern

Router. The first stage evaluates the incoming task and decides which model should handle it. The router is a lightweight classifier -- the cheapest, fastest model available. It reads the user's request, estimates the complexity, and routes to the appropriate worker model. The router itself adds minimal cost and latency (a fraction of a second and a few hundred tokens).

Worker. The second stage executes the task using the model selected by the router. Simple tasks go to a fast, cheap model. Complex tasks go to a powerful, expensive model. The worker does the actual analysis, extraction, or generation. In multi-document scenarios, multiple workers can run in parallel, each handling one piece of the overall task.

Synthesizer. The third stage combines results when the task involved multiple workers. If a single worker handled the entire task, the synthesizer step is skipped -- the worker's output goes directly to the user. If multiple workers each analyzed a different document, the synthesizer merges their outputs into a coherent deliverable.

This is not a novel architectural pattern. It is the same dispatcher-worker-aggregator pattern that distributed systems have used for decades. What makes it interesting in the AI context is that the router itself is a language model, making routing decisions based on semantic understanding of the task rather than keyword matching or rule engines.

Router design

The router is the most important component in the pipeline. If it misroutes, everything downstream suffers: a complex task sent to a cheap model produces poor results, and a simple task sent to an expensive model wastes money.

A good router answers one question: how complex is this task?

The classification does not need to be granular. Three to four tiers are sufficient for most workloads:

Light. Simple classification, formatting, short extraction. "What type of document is this?" "Reformat this text as bullet points." "What is the invoice total?" These tasks require minimal reasoning. A Flash Lite or equivalent model handles them in under a second at minimal cost.

Medium. Standard analysis requiring moderate reasoning. "Summarize this 10-page report." "Extract all key terms from this contract." "Compare these two versions of a document." These tasks need a capable model but not the most powerful one. A Flash-tier model handles them well.

Heavy. Complex analysis requiring deep reasoning, multi-step logic, or nuanced understanding. "Analyze this merger agreement and identify clauses that conflict with our standard terms." "Synthesize findings from these twelve research papers into a literature review." "Review this contract for non-standard provisions and assess their risk." These tasks need a Pro-tier model.

Image. Tasks that require visual processing -- scanned documents, charts, mixed-content pages. These go to a model with strong vision capabilities, regardless of text complexity.

The router model receives the user's request (and optionally the first few hundred tokens of the document) and classifies it into one of these tiers. The classification prompt is short and focused: "Given this task description, classify its complexity as light, medium, or heavy." The router model produces a one-word answer.

Because the router is a lightweight model processing a short prompt, its cost per classification is negligible -- a fraction of a cent. Its latency is typically 200 to 500 milliseconds. This overhead is invisible to the user and pays for itself many times over through correct routing.

Router accuracy matters more than router speed. A router that is 95 percent accurate saves far more money than a router that is 99 percent fast but only 80 percent accurate. The cost of misrouting a complex task to a cheap model (poor results, user frustration, potential rework) exceeds the cost of misrouting a simple task to an expensive model (correct results, slightly higher cost).

In practice, routing accuracy of 90 to 95 percent is achievable with a well-crafted classification prompt and a decent lightweight model. The remaining 5 to 10 percent of misrouted tasks divide roughly equally between over-routing (simple tasks sent to expensive models) and under-routing (complex tasks sent to cheap models). Over-routing wastes money. Under-routing wastes user trust. Most systems bias the router slightly toward over-routing because the cost of a correct-but-expensive result is lower than the cost of a wrong-but-cheap result.

Worker models: matching capability to complexity

Each tier in the routing scheme maps to a model with appropriate capability and cost characteristics.

Tier 1: Flash Lite (or equivalent ultra-fast models). Optimized for speed and cost. Handles classification, formatting, simple extraction, and short-form generation. Processes requests in under a second at fractions of a cent. Weak on multi-step logic and complex analysis.

Tier 2: Flash (or equivalent mid-tier models). Balances capability and cost. Handles standard document analysis, summarization, structured extraction, and moderate reasoning. The workhorse tier -- 3 to 5 times Flash Lite's cost per token, but covers a much wider range of tasks reliably.

Tier 3: Pro (or equivalent top-tier models). The most capable available. Handles complex reasoning, multi-document synthesis, nuanced analysis, and subtle implications. 10 to 30 times Flash Lite's cost per token, but produces results that cheaper models cannot.

The tier boundaries are not sharp. The router does not need to be perfect -- it needs to be right most of the time, and the system needs to handle misroutes gracefully.

The synthesizer stage

When a task involves multiple workers -- analyzing ten documents, processing a batch of invoices, researching across several sources -- the synthesizer combines their outputs.

The synthesizer receives individual worker results and produces a unified output. This is more than concatenation. The synthesizer normalizes terminology across worker outputs, identifies conflicts or inconsistencies, fills gaps where one worker found information that others missed, and structures the combined output in a coherent format.

For document analysis, the synthesizer typically runs on a mid-tier or top-tier model, because cross-document reasoning requires capability that cheap models lack. The synthesizer sees all worker results and must reason across them: "Worker 3 found a liability cap of $5M, but Worker 7 found an indemnification clause that effectively uncaps liability for IP violations. These two findings need to be presented together."

The synthesizer can also be the point where quality control happens. If two workers analyzing the same document produce conflicting extractions, the synthesizer flags the conflict and can re-process the disputed section with a higher-tier model.

For simple aggregation tasks -- "extract the total from each of these 50 invoices" -- the synthesizer may just compile the results into a table. No model call needed. The synthesizer stage adapts its complexity to the task.

Cost optimization in practice

The financial impact of multi-model routing is substantial.

Consider a workload of 1,000 document processing requests per day, with the typical distribution: 800 light, 150 medium, 50 heavy.

Single-model approach (Pro for everything): All 1,000 requests go to Pro. At an average cost of $0.05 per request, the daily cost is $50.

Multi-model approach (routed):

  • 800 light requests go to Flash Lite at $0.002 each: $1.60
  • 150 medium requests go to Flash at $0.01 each: $1.50
  • 50 heavy requests go to Pro at $0.05 each: $2.50
  • 1,000 routing calls at $0.001 each: $1.00
  • Total: $6.60

The multi-model approach costs 87 percent less. At scale -- 10,000 or 100,000 requests per day -- the absolute savings are significant.

Critically, the quality on the 50 heavy requests is identical: they still go to Pro. The quality on the 800 light requests is also identical: Flash Lite handles simple classification and extraction just as well as Pro. The only potential quality difference is on the 150 medium requests, and Flash-tier models handle standard analysis tasks reliably.

The router's cost is negligible. At $0.001 per classification, routing 1,000 requests costs $1. This is less than the cost of misrouting a single heavy request to Flash Lite (which would produce a poor result that the user rejects, requiring a re-run on Pro).

Latency management

Multi-model pipelines introduce additional latency from the routing step. The router must classify the task before the worker can begin. In a synchronous pipeline, this adds 200 to 500 milliseconds to every request.

Several techniques minimize the latency impact.

Speculative execution. Start the Flash-tier worker immediately while the router runs. If the router classifies the task as light or medium, the Flash result is already in progress. If it classifies heavy, abort Flash and start Pro. Since 80 percent of requests are light or medium, speculative execution produces the correct result most of the time with zero routing delay.

Parallel workers. For multi-document tasks, launch all workers simultaneously. Five workers processing five documents in parallel complete in the time of the slowest single analysis, not the sum of all five.

Streaming output. Return results as workers complete. The user sees progress immediately, even if other workers are still running. A system that shows "Analyzing document 3 of 10..." feels responsive, even if the full batch takes two minutes.

System prompt caching. The system prompt is identical across all requests. Caching it eliminates redundant token processing. On Vertex AI, this is explicit context caching with a 60-minute TTL, warmed every 55 minutes to prevent expiration. The per-request savings in both cost and latency are significant.

Error handling across stages

Multi-stage pipelines have more failure points than single-model systems. Each stage can fail independently, and failure handling must be designed into the pipeline.

Router failure. If the router fails to classify the task (model error, timeout, malformed output), the system needs a default. The safest default is to route to the mid-tier model: it handles most tasks well, and the cost is moderate. Over-routing to the expensive model on router failure is safer (correct but expensive) than under-routing to the cheap model (potentially incorrect).

Worker failure. If a worker fails mid-task (timeout, model error, malformed output), the system has options. Retry with the same model is the first attempt. Escalate to a higher-tier model is the second. If Flash fails on a medium task, retrying with Pro often succeeds because the additional capability overcomes whatever edge case caused the failure.

Partial failure in batch processing. When ten workers process ten documents and one fails, the system must decide: present nine results with a note about the failure, retry automatically, or ask the user. A due diligence review needs all documents; a quick survey can tolerate one missing.

Synthesizer failure. Fall back to presenting individual results without synthesis. Less useful, but preserves all worker output.

Cascading failures. When multiple stages fail in sequence, the system needs a circuit breaker. After a threshold of failures, stop processing and report rather than spending tokens on a doomed request.

The general principle: fail gracefully, preserve work, communicate clearly.

Real-world example: processing 100 mixed documents

A legal team has a data room with 100 documents for a due diligence review. The documents include contracts, financial statements, corporate resolutions, property deeds, environmental reports, and correspondence. They need a structured summary of each document, flagged risk items, and a cross-document analysis of key terms.

Stage 1: Classification (Router). The router processes each document's first page and classifies it by type and complexity. Financial statements and correspondence are classified as light. Standard contracts are medium. Complex agreements (merger documents, multi-party contracts, documents with extensive schedules) are heavy. Documents with scanned pages or embedded images are flagged for vision processing.

Result: 45 light, 35 medium, 15 heavy, 5 image-requiring documents.

Stage 2: Analysis (Workers). Workers process documents in parallel, routed to the appropriate model tier.

The 45 light documents go to Flash Lite for structured summaries. Total time (10 in parallel): under 10 seconds. The 35 medium documents go to Flash for detailed clause-by-clause or line-item analysis. Total time (5 in parallel): roughly 2 minutes. The 15 heavy documents go to Pro for risk identification and non-standard clause flagging. Total time (3 in parallel): roughly 5 minutes. The 5 image documents go to a vision-capable model for scanned page processing.

Total wall-clock time: roughly 7 minutes. Sequential processing with Pro: roughly 50 minutes.

Stage 3: Synthesis. The synthesizer (running on Pro, because cross-document reasoning is complex) receives all 100 individual analyses. It produces:

  • A structured index of all documents by type
  • A risk summary flagging the highest-priority items across all documents
  • A cross-reference of key terms (e.g., "the liability cap in the purchase agreement conflicts with the indemnification provision in the services agreement")
  • A list of missing documents (expected document types not found in the data room)

Synthesis takes 60 to 90 seconds. Total pipeline time: under 9 minutes for 100 documents.

How docrew implements routing

docrew's Fly.io proxy implements this three-stage pattern in production.

The router lives in router.ts. Every incoming request passes through a Flash Lite classifier that reads the user's message and categorizes the task as light, medium, heavy, or image. The classification happens in the proxy before the request reaches the worker model, adding roughly 300 milliseconds of latency.

Light tasks route to Gemini Flash Lite -- the fastest, cheapest tier. These are simple questions, classifications, and short extractions. The vast majority of conversational turns fall here.

Medium tasks route to Gemini Flash -- the workhorse tier. Standard document analysis, summarization, and structured extraction.

Heavy tasks route to Gemini Pro -- the reasoning tier. Complex analysis, multi-step logic, nuanced document interpretation.

Image tasks route to Gemini Flash with vision capabilities, processing document images directly alongside text.

The routing is invisible to the user. They send a message and receive a response. They do not choose a model, configure routing rules, or manage tier selection. The proxy handles it automatically.

System prompt caching reduces per-request overhead. The static system prompt -- tool definitions, agent identity, behavioral rules -- is cached on Vertex AI with a 60-minute TTL. A keepalive function in the proxy warms the cache every 55 minutes to prevent expiration. Cache keys are scoped by region and model, so the US Flash cache is independent of the EU Pro cache. This is critical for the multi-model pattern: each worker tier has its own cached system prompt.

The subagent tool enables the worker parallelism stage. When the primary agent identifies a multi-document task, it spawns subagent workers -- each processing one document independently. The subagents run concurrently, and their results stream back via SSE as each one completes. The user sees progressive output: document analyses appearing one by one as workers finish, rather than waiting for all workers to complete before seeing anything.

Per-user regional routing adds another dimension. The proxy reads the user's region setting from their profile and routes to the appropriate Vertex AI endpoint -- us-east1 for US users, europe-west1 for EU users. This happens transparently alongside model routing: a heavy task from an EU user goes to Pro on the EU endpoint.

Design lessons

Building multi-model pipelines teaches several lessons that are not obvious in advance.

The router is cheap insurance. The cost of running a Flash Lite classifier on every request is negligible compared to the savings from correct routing. Do not skip the router to save a fraction of a cent per request.

Bias toward over-routing. A correct result from an expensive model is better than a wrong result from a cheap one. If the router is uncertain, route up, not down. The cost of over-routing is a few extra cents. The cost of under-routing is user trust.

Cache aggressively. System prompts, tool definitions, and model configurations are static across users and sessions. Caching them saves tokens, money, and latency on every request. The caching infrastructure pays for itself within days.

Stream everything. Users tolerate latency when they see progress. A multi-model pipeline that shows the routing decision, worker progress, and partial results as they arrive feels faster than a single-model system that goes silent for two minutes.

Measure routing accuracy. Log the router's classification alongside task outcomes. If medium-routed tasks consistently produce poor results, the classification prompt needs adjustment.

Plan for model evolution. Today's Pro is tomorrow's Flash. A well-designed routing system makes it easy to remap tiers to new models without changing the routing logic.

The multi-model pipeline is not complexity for its own sake. It is the natural engineering response to a simple economic reality: different tasks have different complexity, and paying the same price for all of them is wasteful. Route intelligently, execute appropriately, synthesize carefully. The result is a system that delivers top-tier quality on hard tasks and top-tier efficiency on easy ones -- without asking the user to make any decisions about models at all.

Back to all articles