From Single Prompt to Multi-Step Workflow: How AI Agents Work
How AI agents decompose tasks into steps, the tool-use loop that drives execution, and why real work requires multi-step workflows.
The anatomy of a single prompt
Open any AI chat interface. Type a prompt. Get a response. The entire interaction fits on one screen.
Input goes in. Output comes out. The model considers your prompt, generates text token by token, and delivers a response. This is the interaction model that hundreds of millions of people have learned over the past few years. It is intuitive, immediate, and deeply limited.
A single prompt works when the task can be expressed as a question and the answer can be expressed as text. "What is the capital of France?" works. "Explain the differences between GAAP and IFRS" works. "Write a cover letter for this job description" works. The input is self-contained, the output is self-contained, and no intermediate steps are required.
But consider a different kind of task: "Look at the invoices in my project folder, extract the line items, calculate the total by category, and create a summary spreadsheet."
This is not a question with an answer. This is a task with steps. And a single prompt, no matter how well-crafted, cannot accomplish it in one pass.
Why real tasks need multiple steps
The work that fills your day is not made of questions. It is made of procedures.
Reviewing a contract is not one mental operation. It is reading the document, identifying relevant clauses, comparing them to your standard terms, noting deviations, checking defined terms for consistency, and drafting a summary of issues. Each of those steps depends on the output of the previous one. You cannot summarize issues you have not identified. You cannot identify deviations without first reading the clauses.
Processing invoices is not one calculation. It is opening each file, parsing the format (which varies by vendor), extracting the relevant fields (which are in different locations in each invoice), normalizing the data, performing the calculations, and writing the output. A failure at any step -- a PDF that will not parse, a field that is missing, a format that is unexpected -- requires adjustment before proceeding.
Preparing a report from raw data is not one generation step. It is reading the source data, cleaning it, transforming it, calculating derived metrics, choosing visualizations, generating the report, and reviewing it for errors.
The common thread: real work decomposes into steps, and each step produces output that feeds the next. A single prompt captures the goal. But the goal and the path to the goal are very different things.
This is why AI chat interfaces, despite housing increasingly powerful language models, plateau in their usefulness for actual work. The model is smart enough to understand the task. The interface is too constrained to execute it.
The tool-use loop: plan, execute, observe, decide
AI agents bridge the gap between understanding a task and executing it through a mechanism called the tool-use loop. This is the core architecture that separates an agent from an assistant, and understanding it explains both the capabilities and the limitations of agent-based AI.
The loop has four phases.
Plan. The agent receives your task and reasons about what needs to happen first. This is the same language model reasoning that powers a chat response, but instead of generating a final answer, the model generates a plan -- what tool to use, what arguments to pass, what information it needs.
Execute. The agent invokes the chosen tool. This might be reading a file, listing a directory, writing a script, running code, or searching for content. The tool performs a real action with real side effects. A file is read. Code is executed. Output is produced.
Observe. The tool returns its result to the agent. The agent now has new information -- the contents of the file, the output of the script, the search results. This information was not available when the agent started. It is fresh data that changes the agent's understanding of the task.
Decide. Based on the new information, the agent decides what to do next. Maybe the file contents reveal that a different approach is needed. Maybe the script output contains an error that needs fixing. Maybe the search results answer one question but raise another. The agent reasons about its updated state and chooses the next action.
Then the loop repeats. Plan, execute, observe, decide. Plan, execute, observe, decide. Each iteration brings the agent closer to the goal, with each step informed by everything the agent has learned in previous steps.
This loop is not a fixed recipe. It is adaptive. The agent does not follow a predetermined script. It makes decisions at each step based on what it has observed. If a file is in an unexpected format, the agent adapts. If a calculation produces an anomalous result, the agent investigates. If the task turns out to be more complex than initially apparent, the agent adjusts its approach.
Error recovery and iteration
The tool-use loop gives agents a capability that single-prompt interactions fundamentally lack: the ability to recover from errors.
In a chat interface, if the model's response is wrong, you -- the human -- detect the error, formulate a correction, and submit a new prompt. You are the error-detection and error-correction layer. The model is stateless between your messages in the sense that it cannot independently decide to try a different approach.
An agent handles errors internally.
Consider what happens when an agent writes a Python script to parse an invoice and the script throws an exception. In the chat model, you would see the script, copy it to your terminal, run it, see the error, paste the error back into the chat, ask for a fix, get a new script, copy it, run it, and repeat. You are the execution environment and the feedback loop.
The agent sees the error immediately. It reads the traceback. It identifies the problem -- maybe the PDF parser returned binary data instead of text, or a date field used an unexpected format. It rewrites the relevant section of the script. It runs the corrected version. If the correction introduces a new error, it iterates again. This cycle happens in seconds, without human intervention.
This is not artificial intelligence in some abstract sense. It is a practical loop: try, fail, analyze, adjust, retry. The same loop that a competent developer follows when debugging. The same loop that a careful analyst follows when a spreadsheet formula produces unexpected results.
The speed of this loop is what makes agents practical for real tasks. A task that would require seven rounds of copy-paste-debug between a human and a chat interface takes the agent fifteen seconds of autonomous iteration. The human's time is not spent on mechanical debugging. It is spent on reviewing the final output.
Example: "analyze these invoices" in eight steps
Abstract descriptions of the tool-use loop are useful for understanding the architecture. Concrete examples are useful for understanding why it matters. Let us trace what happens when you give an agent a realistic task.
You tell the agent: "Look at the invoices in the Henderson project folder. Extract all line items, categorize them, calculate totals by category, and create a summary report."
Step 1: List the directory. The agent uses its file-listing tool to see what is in the Henderson project folder. It finds 14 PDF files, 3 Excel files, and 2 Word documents. The agent now knows the scope of the task.
Step 2: Read a sample. The agent reads the first PDF to understand the invoice structure -- where line items are located, how amounts are formatted, what categories exist. It discovers that this vendor uses a three-column table: description, quantity, unit price.
Step 3: Read another sample. The agent reads an Excel file and finds a different structure -- a flat list with columns for item code, description, total amount, and tax. Two different formats for the same kind of data.
Step 4: Write an extraction script. Based on its observations, the agent writes a Python script that handles both formats. It uses a PDF parser for the PDFs and a spreadsheet reader for the Excel files. The script extracts line items from each file and normalizes them into a common structure: description, category, amount.
Step 5: Execute the script. The agent runs the script in a sandbox. 16 of 19 files process successfully. Three PDFs fail because they are scanned images, not text PDFs, and the text parser returns empty results.
Step 6: Handle the failures. The agent reads the error output, identifies the three problematic files, and adjusts its approach. It adds OCR-based text extraction as a fallback for the image PDFs, reruns the extraction for those three files, and succeeds.
Step 7: Categorize and calculate. With all line items extracted, the agent writes a second script to categorize items (using the language model for ambiguous descriptions) and calculate totals by category. It runs this script and produces a structured summary.
Step 8: Write the report. The agent writes the summary report as both a formatted text file and a CSV that can be opened in a spreadsheet application. It saves both to the Henderson project folder and tells you the task is complete.
Eight steps. Three tool types used (file listing, file reading, code execution). One error encountered and handled. The human involvement was typing the initial instruction and reviewing the output.
In a chat interface, each of these steps would require human action. You would paste the contents of files, copy scripts, run them locally, paste back errors, request modifications, and manually assemble the final output. The same task, but with you as the execution engine.
Orchestration patterns
Not all multi-step workflows follow a simple linear sequence. The tool-use loop supports several orchestration patterns, and agents use different patterns for different tasks.
Sequential. Steps happen one after another. Read file A, then process it, then write the result. This is the simplest pattern and the most common for tasks with clear dependencies. You must read before you can process, and you must process before you can write.
Iterative. The agent repeats a set of steps multiple times, each time with different input or refined parameters. Processing 19 invoices involves the same extraction logic applied 19 times. The loop structure handles this naturally -- the agent executes the same tool with different arguments each time.
Branching. The agent's decision at a step depends on what it observes. If the file is a PDF, use the PDF parser. If it is an Excel file, use the spreadsheet reader. If the extraction fails, try OCR. Branching is where the agent's reasoning capability matters most -- it must evaluate the situation and choose the appropriate path.
Error-recovery. A specialized form of branching where the agent detects a failure and switches to an alternative approach. The three scanned PDFs in our invoice example triggered an error-recovery branch. The agent did not stop and ask for help. It identified the problem, chose a different tool, and continued.
Aggregation. After processing multiple items in parallel or sequence, the agent combines the results into a unified output. All 19 invoice extractions are aggregated into a single categorized summary. This step often involves additional reasoning -- deduplicating entries, resolving inconsistencies, calculating totals.
These patterns compose. A realistic agent workflow might involve sequential steps that each contain iterative loops with branching and error-recovery, followed by an aggregation phase. The agent manages this complexity internally. You see the input and the output. The orchestration is invisible.
Why this is invisible to the user
The multi-step workflow is the agent's mechanism, not the user's concern.
When you ask a colleague to "analyze the Henderson invoices and give me a summary," you do not specify the steps. You do not say "first, list the files, then read the first one, then write a script." You state the goal. Your colleague figures out the path.
The same principle applies to agents. The tool-use loop is an implementation detail. The user says what needs to happen. The agent determines how to make it happen, step by step, adapting as it goes.
This is why the agent's chat interface looks deceptively simple. You type a message. After a few seconds or a few minutes, results appear. Maybe you see brief status updates -- "reading files," "processing data," "writing report." Maybe you see nothing until the final output lands.
The simplicity is not a limitation. It is a design choice. The value of the agent is precisely that you do not have to think about the steps. You do not have to sequence the operations, handle the errors, manage the intermediate state, or orchestrate the workflow. You define the outcome. The agent delivers it.
This does not mean the process is opaque. A well-designed agent shows its work -- which files it read, which tools it used, what code it wrote. You can review the process after the fact. But you do not have to direct the process in real time. The difference between directing and reviewing is the difference between doing the work yourself and delegating it.
The gap between one step and many
The AI industry spent the last few years making single-step interactions better. Larger context windows. Faster inference. Better reasoning. Multimodal input. These improvements are real and valuable.
But they are improvements within the single-step paradigm. A better response to a single prompt is still a single response. A larger context window means you can paste more text before asking your question, but you still ask one question and get one answer. Faster inference means the answer arrives sooner, but it is still one answer to one question.
The transition from single-step to multi-step is not a quantitative improvement. It is a qualitative shift. It moves AI from a tool you interact with -- turn by turn, prompt by prompt -- to a system that executes on your behalf.
docrew is built on the multi-step paradigm. The Rust agent runtime implements the tool-use loop natively -- plan, execute, observe, decide -- with tools for reading files, writing files, executing code, searching content, and parsing documents. Each task you describe becomes a workflow that the agent executes autonomously, adapting to errors, handling unexpected formats, and delivering finished output.
The chat window is still there. It is where you describe the task and where you receive the result. But between your input and the output, the agent is working -- reading, writing, executing, iterating, recovering, synthesizing. The steps are real. The work is real. The output is not a description of what could be done. It is the done thing.
That gap -- between describing work and delivering work, between one step and many -- is where the next generation of AI tools lives. The models are ready. The intelligence is there. What was missing was the architecture to turn intelligence into execution, one step at a time, until the job is finished.