16 min read

How docrew's Agent Architecture Works: Models, Tools, Memory

A technical deep-dive into docrew's agent architecture -- how a custom Rust runtime, model routing, tool-use loops, and OS-level sandboxing work together to process documents locally.


Why document agents are not chatbots

A chatbot takes text in and produces text out. You paste a paragraph, it summarizes it. You ask a question, it answers. The interaction is stateless, single-turn, and bounded by what you put in the prompt.

A document agent operates differently. It receives a goal -- "extract all payment terms from these contracts" or "compare these two versions and list the changes" -- and then executes a multi-step workflow to accomplish it. It reads files from your disk. It decides which files are relevant. It processes them in sequence or in parallel. It writes output. It handles errors along the way.

The difference is not just capability. It is architectural. A chatbot needs a model and an API. A document agent needs a model, a tool system, a memory layer, an orchestration loop, safety constraints, and a sandboxed execution environment. Each of these components has its own engineering challenges, and the quality of the overall system depends on how well they work together.

This article explains how docrew's agent architecture works -- the actual technical decisions, the trade-offs, and why they matter for document processing.

Three pillars: model, tools, memory

Every agent system rests on three pillars. The model provides reasoning. The tools provide capability. The memory provides context. Remove any one and the system degrades.

The model is the reasoning engine. It reads the user's request, generates a plan, and decides which tools to invoke. It interprets tool results and decides what to do next. The model cannot act on its own -- it can only think and request actions.

The tools are the hands. They interact with the real world: reading files, writing files, searching directories, running code, parsing document formats. Without tools, the model is a text generator with no ability to touch your data.

The memory is the context. Conversation history, workspace state, persistent knowledge across sessions. Memory allows the agent to reason about its own progress, avoid redundant work, and maintain coherence across long tasks.

These pillars are not independent. The model uses memory to decide which tools to call. Tool results are added to memory. A weak link in any pillar degrades the entire system.

The tool-use loop

The core of docrew's agent is the tool-use loop. This is the orchestration mechanism that ties the three pillars together.

The user sends a message. The model receives it along with the current memory. It reasons about the task and generates a response that may include tool calls -- file_list to see what is in a folder, file_read to read a contract. The runtime executes the tool calls and appends results to memory. The model reasons again with the updated context. This cycle repeats until the model determines the task is complete and produces a final response without tool calls.

Every task, from a simple file read to a complex multi-document analysis, is executed through this cycle.

The loop is implemented in Rust as part of docrew's custom agent runtime, compiled directly into the Tauri desktop application. No Node.js process, no Python interpreter, no sidecar binary. Native code running at native speed.

Why Rust? Performance: the loop needs to manage concurrent operations, stream data, and handle multiple sessions without blocking (Tokio async runtime). Reliability: memory safety without garbage collection means no unexpected pauses during long-running tasks. Integration: the agent runtime is part of the same binary as the desktop application, with direct access to the file system, system keychain, and OS-level sandboxing APIs.

What tools the agent has

The tools available to the agent define what it can actually do. docrew's tool set is designed for document work:

File operations form the foundation. file_read reads any file within the workspace. file_write creates or updates files. file_list shows directory contents with metadata. file_grep searches file contents across the workspace using pattern matching. These four tools give the agent the ability to navigate a file system, find relevant documents, and produce output.

Document parsers handle structured formats. docx_read parses Microsoft Word documents, extracting the semantic structure: headings, paragraphs, tables, numbered lists, styles. It is not a text dump -- it preserves the document hierarchy so the model can reason about sections and structure. xlsx_read does the same for Excel spreadsheets: worksheets, cell values, formulas, shared strings.

Both parsers are implemented in Rust as part of the agent binary. They parse the OOXML format directly (DOCX and XLSX are ZIP archives containing XML), handling namespaces, styles, numbering, and rendering. There are no external dependencies, no Python libraries, no system commands. The parser is the tool.

Code execution extends the agent's capabilities. shell runs shell commands in a sandboxed environment. python executes Python scripts in the same sandbox. These tools let the agent perform computations, data transformations, and format conversions that would be impractical in pure language model reasoning. Need to calculate the sum of a column in a spreadsheet? The agent writes a quick Python script and runs it.

Subagent enables parallel processing. The agent can spawn worker agents for independent subtasks. If the task is "summarize each of these ten documents," the orchestration layer can delegate each summary to a subagent, collect the results, and synthesize a final output. This is particularly valuable for large document sets where processing sequentially would be slow.

Connector integrates with external services through a connector framework that manages OAuth authentication and token lifecycle. The agent can interact with third-party tools without handling auth flows directly. An HTTP client allows direct API interactions for simpler cases.

Each tool has a defined input schema and output format. The model sees tool definitions as part of its system prompt and learns to use them from the descriptions and parameter specifications -- no fine-tuning required.

Model routing: right model for the right task

Not every request needs the same model. Classifying a document as "invoice" or "contract" is a lightweight task. Analyzing a 100-page merger agreement and identifying non-standard clauses is heavy. Using a top-tier model for classification wastes money. Using a lightweight model for complex analysis produces poor results.

docrew solves this with a model router. Every request passes through a Flash Lite classifier running on the Fly.io proxy. The classifier evaluates the user's request and the conversation context, then routes the request to the appropriate Gemini model:

  • Flash Lite handles light tasks: classification, formatting, simple questions, short summaries.
  • Flash handles standard work: document analysis, extraction, comparison, moderate reasoning.
  • Pro handles complex tasks: multi-document synthesis, nuanced analysis, strategic recommendations.

The routing decision adds a fraction of a second of latency but can reduce costs by 5x to 10x on mixed workloads. The user never sees the routing. They describe their task and get the appropriate model automatically.

This routing is complemented by system prompt caching. The system prompt -- which includes all tool definitions, agent identity, behavioral rules, and model guidance -- is identical for every user and every session. Sending it fresh with every request wastes tokens. On Vertex AI, docrew uses explicit context caching: the static system prompt is stored as a cached resource with a 60-minute time-to-live. Requests reference the cached version instead of including the full prompt.

To keep the cache alive, a warmup function runs every 55 minutes, refreshing the cached content before it expires. This ensures that even during low-traffic periods, the first user request does not pay the full cache-write cost.

Regional routing adds another dimension. A European user's requests are processed by Vertex AI endpoints in Europe. A US user's requests go to US endpoints. The model layer supports per-user routing based on the user's data residency preference, stored in their profile. Cache keys include the region, so each region maintains its own cached system prompt.

Safety: cost ceilings, limits, loop detection

An autonomous agent that can read files, write files, and run code needs guardrails. Without them, a poorly specified task or an unexpected edge case could result in runaway costs, infinite loops, or unintended actions.

docrew's safety layer operates at three levels:

Cost ceiling limits the total spend on language model tokens per session. Every model call has a token cost. The safety layer tracks cumulative cost and terminates the session if it exceeds the ceiling. This prevents a runaway agent from consuming hundreds of dollars in tokens on a single task. The ceiling is set high enough for legitimate complex tasks but low enough to catch genuine runaway scenarios.

Maximum tool call limit caps the number of tool invocations per session. This is a separate constraint from cost because some tool calls are cheap (file reads) but could still indicate a problem if the agent is calling them hundreds of times. The limit prevents the agent from entering a state where it is making progress in terms of tool calls but not in terms of task completion.

Loop detection identifies when the agent is repeating the same actions without making progress. If the agent reads the same file three times in a row, or generates the same tool call sequence repeatedly, the loop detector intervenes. This catches a common failure mode where the model gets stuck in a reasoning loop -- it calls a tool, gets a result, forgets it called the tool, and calls it again.

These three safety mechanisms work together. A legitimate complex task might use many tool calls (within the limit) and significant tokens (within the ceiling) without triggering loop detection. A runaway task will typically hit one of the three constraints before causing real damage.

The safety layer is implemented in Rust as part of the agent runtime. The constraints are enforced at the tool execution level -- before any tool call is executed, the safety layer checks whether it is permitted. This is not advisory. The tool call is blocked if the constraint is violated.

Sandboxing: local execution done safely

An agent that can run shell commands and Python scripts on your machine needs to be sandboxed. The sandbox defines what the agent's code can and cannot do.

docrew uses OS-level sandboxing, not application-level restrictions. This is an important distinction. Application-level sandboxing (filtering commands, blocking certain system calls in user space) can be bypassed. OS-level sandboxing (using the operating system's own isolation mechanisms) cannot be circumvented by the sandboxed process.

On macOS, docrew uses Apple's Seatbelt framework. Seatbelt profiles define exactly which file paths the sandboxed process can access, whether it can use the network, and which system calls are permitted. The sandbox profile restricts access to the workspace directory and denies network access by default. A Python script spawned by the agent can read and write files within the project folder but cannot access the internet, read your home directory, or call system APIs outside the permitted set.

On Linux, docrew uses bubblewrap (bwrap), the same containerization tool used by Flatpak. bwrap creates a lightweight namespace container with a controlled filesystem view. The agent's processes see only the workspace directory and essential system libraries. Network access is denied by default. The isolation is at the kernel level.

Network isolation is the default state. The sandbox does not allow network access unless explicitly enabled via a configuration flag (DOCREW_SANDBOX_ALLOW_NETWORK). This means that even if the agent writes and executes a script that tries to exfiltrate data, the network request will fail at the OS level. The script has no way around it -- the sandbox is enforced by the kernel, not by the application.

This sandboxing approach is a direct consequence of the desktop architecture. A cloud-based agent runs code in a server-side container where the vendor controls the environment. A desktop agent runs code on the user's machine, where the consequences of a sandbox escape are far more serious. OS-level sandboxing is the only approach that provides genuine security guarantees for local code execution.

How streaming works

Document processing tasks are not instant. An agent analyzing a folder of contracts might run for 30 seconds to several minutes, making dozens of tool calls along the way. During that time, the user needs feedback. A silent application that freezes for two minutes is a bad experience.

docrew streams the agent's activity in real time using Server-Sent Events (SSE). The data flow works like this:

The desktop agent sends a request to the Fly.io proxy. The proxy forwards it to Vertex AI. Vertex AI streams the model's response token by token via SSE. The proxy relays the SSE stream back to the desktop client. The desktop client renders each token as it arrives.

When the model generates a tool call, the stream includes the tool call specification. The desktop runtime executes the tool locally and sends the result back to the proxy for the next model turn. The process repeats, with each model response streamed in real time.

From the user's perspective, they see the agent thinking in real time. They see it decide to read a file, see the file being read, see the agent reason about the contents, see it decide to read another file. This transparency builds trust and lets the user interrupt if the agent is going in the wrong direction.

The streaming architecture also handles multi-session scenarios. The desktop application supports multiple chat panes, each running its own agent session. Each session has its own SSE connection, its own state, and its own safety counters. Sessions are fully isolated -- one session's activity does not affect another's. A long-running analysis in one pane does not block a quick question in another pane.

Why desktop matters

There is a reason docrew runs as a desktop application with a native agent runtime, rather than a web application with a cloud backend.

Direct file system access. The agent reads your documents from disk. It navigates your folder structure. It writes output files where you want them. There is no upload step, no file size limit imposed by a web API, no conversion step where your document is transformed into a cloud-friendly format. The agent works with your files as they are, where they are.

Privacy by architecture. Your files never leave your machine. The agent sends only extracted text and reasoning to the language model. The documents themselves stay on your disk. This is not a policy promise -- it is an architectural guarantee. Native code reads the files locally, extracts text, and sends that text to the model. The original files are never transmitted.

Native performance. File operations, document parsing, and tool execution happen at native speed. No serialization overhead from crossing a network boundary. A file read takes microseconds, not the hundreds of milliseconds a cloud API requires.

OS integration. Sandbox profiles use the OS's own isolation mechanisms. Credentials are stored in the system keychain. The application runs as a native process with proper lifecycle management.

Offline capability. Except for language model calls, the agent's tool execution is entirely local. File reads, document parsing, code execution all work without internet. If the network drops, tool execution continues -- only the model reasoning pauses.

These are not features you can replicate in a browser tab.

How the pieces fit together

A concrete example ties the architecture together. A user asks: "Compare the payment terms in contract-v1.docx and contract-v2.docx."

The request goes to the Fly.io proxy. The Flash Lite classifier evaluates the task as medium complexity and routes it to the Flash model. The system prompt is loaded from Vertex AI's cache. The model receives the request and the system prompt, including all tool definitions.

The model generates its first response: a file_read call for contract-v1.docx. The Rust runtime intercepts the tool call and executes it locally. Because the file is a DOCX, the runtime uses the built-in OOXML parser to extract the semantic content -- headings, paragraphs, tables, numbered lists -- and returns the structured text to the model.

The tool result is added to memory. The model reasons about the content, identifies the payment terms section, and generates a second tool call: file_read for contract-v2.docx. The same parsing process happens. Now the model has both documents in its context.

The model compares the payment terms, identifies differences (changed payment deadlines, new late payment penalties, modified discount terms), and generates a structured comparison. The safety layer has tracked two tool calls and the associated token cost -- well within limits. No loop detection triggered because each tool call was unique.

The entire interaction is streamed via SSE. The user sees the agent read the first file, then the second, then produce the comparison. Total time: 10 to 20 seconds, depending on document length and model latency.

This is a simple example. The architecture handles far more complex scenarios: processing entire folders, spawning subagents for parallel analysis, writing output files, running Python scripts for data transformation. The same loop, the same safety constraints, the same sandboxing. The complexity of the task scales within the same architectural framework.

Trade-offs and what they cost

Every architectural decision has trade-offs. docrew's architecture optimizes for privacy, performance, and document processing capability. Here is what that costs:

Desktop-only agent execution. The full agent runtime runs only on the desktop application. Mobile communicates with the agent via the proxy, but the heaviest processing happens on desktop.

Rust complexity. A Rust agent runtime is harder to develop and iterate on than Python or Node.js. Adding a new tool requires writing Rust code, handling error types, and ensuring memory safety. The payoff is performance and reliability, but the development velocity trade-off is real.

Model dependency. The agent's reasoning depends on Vertex AI availability. If Vertex AI is slow or down, the reasoning loop pauses. Local tool execution continues, but the agent cannot think without the model.

Cache management. System prompt caching saves significant cost but adds operational complexity. Cache keys must account for regions and models. A cache miss means higher latency and cost.

These are deliberate trade-offs. The architecture prioritizes file privacy, processing speed, format support, and execution safety -- and accepts costs in other areas.

Where this is heading

The agent architecture is not static. Richer tool sets, smarter routing, and improved memory will expand what the agent can handle. But the core architecture -- Rust runtime, tool-use loop, model routing, OS-level sandboxing, SSE streaming -- is designed to support these improvements without fundamental redesign. The pillars remain the same. The capabilities built on top of them grow.

Back to all articles