December 18, 202514 min read

The AI Agent Stack: Models, Tools, Memory, and Orchestration

Every AI agent system has four layers: model, tools, memory, and orchestration. Understanding the stack helps you evaluate agent products critically.

Four layers, one system

Every AI agent -- from the simplest chatbot wrapper to the most sophisticated autonomous assistant -- is built on the same four-layer architecture. The layers are: model, tools, memory, and orchestration.

Some products expose all four layers. Most hide them. But they are always there, and understanding them is the best way to evaluate what an agent can actually do versus what its marketing says it can do.

The model layer is the intelligence. The tools layer is the capability. The memory layer is the context. The orchestration layer is the control system that ties it all together.

Each layer has distinct responsibilities, failure modes, and cost characteristics. An agent is only as strong as its weakest layer. A brilliant model with no tools is a chatbot. Powerful tools with no orchestration is a collection of scripts. Perfect memory with a weak model is an expensive filing cabinet.

Let's break each one down.

Layer 1: The model

The model is the reasoning engine. It reads input, understands intent, generates plans, and produces output. This is the part most people think of when they hear "AI agent."

But model selection is more nuanced than picking the smartest model available.

Not every task needs the best model. Classifying a document as "invoice" or "contract" does not require the same reasoning power as analyzing a 50-page merger agreement. Using a top-tier model for classification is like hiring a senior attorney to sort mail. It works, but the cost-to-value ratio is absurd.

Model routing is the practice of sending different tasks to different models based on complexity. A lightweight model handles simple tasks: classification, formatting, short summaries. A mid-tier model handles standard analysis: extraction, comparison, moderate reasoning. A top-tier model handles complex work: multi-document synthesis, nuanced legal analysis, strategic recommendations.

The router itself can be a lightweight model. It reads the user's request, estimates complexity, and routes to the appropriate model. This adds a small amount of latency (a fraction of a second) but can reduce costs by 5x to 10x on mixed workloads.

docrew implements this with a Flash Lite classifier in the Fly.io proxy. Every request passes through the router, which evaluates the task and selects the appropriate Gemini model -- Flash Lite for light tasks, Flash for standard work, Pro for complex analysis. The user does not see this. They get the right model for the job, every time, without paying top-tier prices for trivial tasks.

Cost optimization at the model layer also includes caching. The system prompt -- the instructions that tell the agent how to behave, what tools it has, and what rules to follow -- is identical for every user and every session. Sending it fresh with every request wastes tokens and money. Caching the system prompt and reusing it across requests reduces the per-request cost significantly.

On Vertex AI, this is explicit context caching: the system prompt is stored as a cached resource with a time-to-live, and requests reference the cached version instead of including the full prompt. The cache needs periodic warming to stay alive, but the cost savings are substantial -- especially for long system prompts with detailed tool definitions.

Regional routing adds another dimension. Data residency requirements mean that a European user's requests might need to be processed by a model endpoint in Europe, while a US user's requests go to a US endpoint. The model layer must support per-user routing based on regulatory requirements, not just performance optimization.

The model layer is not just "which model" -- it is which model, for which task, in which region, at what cost, with what caching strategy.

Layer 2: The tools

An AI agent without tools is a language model with a system prompt. It can reason, but it cannot act. The tools layer is what separates agents from chatbots.

Tools are functions the agent can call to interact with the world. Each tool has a defined interface: input parameters, expected output, and side effects. The agent decides which tools to call, in what order, with what parameters.

The tools available to an agent define its capability surface. An agent with file read and write tools can process documents. Add a shell execution tool and it can run code. Add an HTTP client and it can interact with APIs. Add document parsers and it can understand complex file formats natively.

File operations are the foundation. Read, write, list, search. These seem basic, but they are what enable an agent to work with your actual data rather than text you paste into a chat window. An agent that can list a directory, read each file, and write output is fundamentally more capable than one that can only process text you provide.

Code execution is the multiplier. When an agent can write and run Python scripts in a sandboxed environment, its capability surface expands dramatically. Data transformation, statistical analysis, chart generation, format conversion -- anything expressible in code becomes something the agent can do. The sandbox is critical: the code runs in an isolated environment where it cannot access the network, modify the system, or read files outside the project folder.

Document parsers are specialized tools for structured file formats. A DOCX file is not plain text -- it is a ZIP archive containing XML files that define paragraphs, tables, styles, headers, footers, and embedded objects. A purpose-built DOCX parser extracts the semantic structure of the document: headings with their hierarchy, tables with their rows and columns, numbered lists with their nesting. Similarly, an XLSX parser understands worksheets, cell types, formulas, and shared strings.

These parsers are distinct from general-purpose file reading. Reading a DOCX file as raw bytes gives you XML noise. Parsing it with a proper OOXML implementation gives you the document as a human would read it, complete with structural information that the language model can reason about.

HTTP clients let the agent interact with external services. API calls, webhook triggers, data retrieval from web endpoints. Combined with connector frameworks that manage authentication for third-party services, this extends the agent's reach beyond the local machine.

The quality of the tools layer determines what the agent can actually do. Weak tools -- file reading without parsing, code execution without sandboxing, no HTTP access -- limit the agent to simple tasks. Strong tools with proper security boundaries enable complex, multi-step workflows.

Layer 3: The memory

Memory is the context the agent brings to each interaction. Without memory, every conversation starts from zero. With memory, the agent builds understanding over time.

Memory operates at three scales:

Conversation context is the immediate memory. The messages exchanged in the current session. What the user asked, what the agent did, what the results were. This is the minimum viable memory -- even a simple chatbot has this.

But conversation context has limits. Language models have finite context windows. A conversation that includes twenty documents worth of extracted text, dozens of tool calls and their results, and extensive back-and-forth can exceed the context window. When this happens, early parts of the conversation are truncated or summarized, and the agent loses access to information discussed earlier in the session.

Managing conversation context effectively means being intelligent about what stays in context and what gets summarized or dropped. Not every tool call result needs to remain in full. Not every intermediate step matters for the next turn. Context management is a quiet but critical engineering challenge.

Project knowledge is the medium-term memory. Information about the user's workspace: which files exist, what the project structure looks like, what was analyzed previously. This helps the agent orient itself without re-reading every file at the start of each session.

In docrew, project knowledge lives in memory files within the workspace. The agent can read and update these files across sessions, building a persistent understanding of the project. When you start a new conversation about the same set of contracts, the agent already knows the file structure, the document types, and the key findings from previous analyses.

Cross-session memory is the long-term memory. Patterns and preferences that persist across projects and conversations. The user prefers tables over prose. The user works with European contracts that use specific legal terminology. The user always wants dates in ISO format.

This type of memory is the least developed across the industry. Most agent products treat each conversation as independent. The few that implement cross-session memory often do it poorly -- storing everything indiscriminately, which makes the memory noisy and eventually counterproductive.

Good long-term memory is selective. It captures preferences, patterns, and corrections. It does not store every fact from every conversation. The challenge is deciding what is worth remembering, and that decision itself requires intelligence.

Layer 4: The orchestration

Orchestration is the control system. It decides what happens, in what order, and with what constraints. If the model is the brain, the tools are the hands, and the memory is the filing cabinet, then orchestration is the nervous system that connects them.

The core of orchestration is the tool-use loop. The agent receives input, reasons about it, decides to call a tool, receives the tool's output, reasons again, possibly calls another tool, and continues until the task is complete or a stopping condition is met.

This loop sounds simple. It is not.

Error handling. Tools fail. Files are missing. Code throws exceptions. API calls return errors. The agent needs to handle failures gracefully: retry where appropriate, adapt the plan when a tool is unavailable, and communicate clearly when a task cannot be completed.

Safety constraints prevent the agent from running up costs or getting stuck. A cost ceiling limits the total spend on language model tokens per session. A maximum tool call limit prevents infinite loops. Loop detection identifies when the agent is repeating the same action without making progress.

These constraints are essential. Without a cost ceiling, a runaway agent analyzing a massive document set could consume hundreds of dollars in model tokens before anyone notices. Without loop detection, an agent that encounters a persistent error might retry the same failed operation indefinitely. Without a maximum tool call limit, a poorly defined task could generate thousands of intermediate steps.

Coordination becomes important when the agent manages multiple parallel tasks. The subagent pattern -- spawning worker agents for independent subtasks -- requires the orchestration layer to manage concurrency, collect results, and handle partial failures. If one subagent fails while four others succeed, the orchestrator must decide whether to retry, skip, or abort.

Streaming is the user-facing aspect of orchestration. The agent's reasoning and tool calls happen in real time, and the user needs to see progress. Streaming the agent's thought process, tool invocations, and partial results provides transparency and builds trust. A silent agent that disappears for two minutes and then dumps a wall of text is a worse experience than one that shows you each step as it happens.

docrew's orchestration layer is a Rust implementation compiled directly into the desktop application. The tool-use loop runs natively, with per-session state management, SSE streaming from the language model, and safety constraints enforced at the runtime level. The choice of Rust is deliberate: the orchestration layer needs to be fast, reliable, and memory-efficient, especially when managing multiple concurrent sessions.

How the layers interact

The four layers are not independent. They form a feedback loop.

The model reasons about the task and decides which tools to call. The tools execute and return results. The results are added to memory (conversation context). The model reasons about the updated context and decides the next action. The orchestration layer manages this cycle, enforcing constraints and handling errors.

The quality of each layer affects the others. A weak model makes poor tool selection decisions, which produces bad results, which pollutes the memory, which leads to worse decisions. A strong model with unreliable tools produces inconsistent results that erode trust. Perfect tools with no memory make the agent repeat work because it forgets what it already did.

This interdependence is why evaluating an agent product requires looking at all four layers, not just the model. "We use GPT-4" or "powered by Gemini" tells you about one layer. It tells you nothing about the tools, the memory architecture, or the orchestration quality.

Evaluating agent products

When you assess an AI agent product, ask these questions about each layer:

Model layer:

Does it use a single model for everything, or does it route tasks to appropriate models?
Is there a caching strategy for the system prompt?
What are the per-request costs, and are they proportional to task complexity?
Does it support regional routing for data residency?

Tools layer:

What can the agent actually do beyond generating text?
Can it read and write files? Which formats?
Can it execute code? Is the execution sandboxed?
Can it interact with external services?
Are the tools well-implemented, or do they fail on edge cases?

Memory layer:

Does the agent maintain context within a conversation?
Does it remember anything across sessions?
How does it handle context window limits?
Is the memory selective or does it store everything indiscriminately?

Orchestration layer:

What safety constraints exist? (Cost ceiling, tool call limits, loop detection)
How does it handle errors? Does it retry, adapt, or just fail?
Does it support parallel execution?
Does it stream progress, or does it go silent until completion?
Is the orchestration a thin wrapper around API calls, or a purpose-built runtime?

Products that excel at one layer but neglect others will feel impressive in demos and frustrating in practice. A powerful model behind a janky orchestration layer produces unreliable results. Excellent tools without proper safety constraints produce expensive surprises. Strong memory without good tooling means the agent remembers everything but can do nothing.

The stack in practice

Here is how docrew implements each layer, as a concrete reference point:

Model: Gemini via Vertex AI, with three tiers (Flash Lite, Flash, Pro) and an automatic router. Explicit context caching on the system prompt. Per-user regional routing (US or EU endpoints) based on the user's profile setting. The user never chooses a model manually -- the router handles it.

Tools: Local file operations (read, write, list, search), Python execution in an OS-level sandbox, DOCX and XLSX parsers built in Rust for zero-dependency document understanding, an HTTP client for API interactions, a subagent tool for parallel task delegation, and connectors for third-party service integration. All tools run on the user's machine. Files never leave the device.

Memory: Per-session conversation context with context management as conversations grow. Project memory files that persist across sessions within a workspace. Cross-session state for patterns the agent observes over time.

Orchestration: A custom Rust runtime compiled into the desktop application. The tool-use loop runs natively with SSE streaming. Safety includes cost ceilings, maximum tool call limits, and loop detection. The runtime supports multiple concurrent sessions (parallel chat panes) with per-session isolation. Each session has its own state, its own tool call history, and its own safety counters.

This is not the only way to build an agent stack. But it is a complete one, and it illustrates what each layer looks like when it is purpose-built rather than bolted on.

Why the stack matters

The AI agent space is crowded with products that wrap a language model API, add a file upload feature, and call it an agent. These products work for simple tasks. They break on complex ones.

Understanding the four-layer stack gives you a framework to see through the marketing. When a product says "AI agent," ask what tools it has. When it says "intelligent assistant," ask about its memory architecture. When it says "autonomous," ask about its safety constraints.

The stack is not theoretical. It is the actual engineering that determines whether an agent can reliably process your twenty contracts, execute a multi-step analysis, remember your preferences, and do it all without running up a four-figure bill on model tokens.

Every serious agent product is making decisions at each layer. The question is whether those decisions are deliberate and well-engineered, or accidental and fragile.

The model gets the headlines. The orchestration does the work. The tools define the capability. The memory shapes the experience. All four layers matter, and now you know how to evaluate them.

Back to all articles