Zero-Trust Document Processing: What It Means in Practice
Zero trust isn't just a network concept. Applied to document processing, it means verifying every access, minimizing exposure, and never trusting the pipeline by default.
Zero trust beyond the network
Zero trust started as a network security concept: never trust, always verify. Don't assume that traffic inside the corporate perimeter is safe. Authenticate and authorize every request, regardless of where it comes from.
The concept has been wildly successful for network security. But it hasn't been consistently applied to one of the most sensitive data flows in modern organizations: document processing with AI.
When you upload a document to a cloud AI service, you're implicitly trusting:
- The upload endpoint is who it claims to be
- The provider's storage is secure
- The provider's processing pipeline hasn't been compromised
- The provider's employees won't access your data
- The provider's sub-processors are equally trustworthy
- The provider's logging doesn't expose your content
- The provider's retention policies are enforced correctly
- The provider's backup systems are secured
That's a lot of trust for an organization that claims to practice "zero trust security."
Zero-trust document processing means applying the same skepticism to your AI tools that you apply to your network. Here's what that looks like in practice.
The five principles of zero-trust document processing
1. Minimize data exposure
Zero trust's core principle: expose the minimum data necessary for the task.
For document processing, this means:
Don't send the file when you can send the text. A PDF contains text, fonts, images, metadata, revision history, digital signatures, and format-specific structures. The AI needs the text. Sending the entire file exposes everything else unnecessarily.
docrew implements this by parsing files locally and sending only extracted text to the language model. The raw file -- with all its metadata and embedded content -- never leaves the device.
Don't send the whole document when you can send the relevant section. If the task is "extract the payment terms from this contract," the model doesn't need the entire 50-page contract. It needs the section discussing payment terms.
Smart context management can extract relevant sections before sending to the model, further minimizing the data that crosses any trust boundary.
Don't persist what can be transient. If the model only needs to see the text during analysis, don't store it anywhere else. Transient processing means the data exists only for the duration of the API call.
2. Verify every component
In a zero-trust network, every device and user is authenticated. In zero-trust document processing, every component in the pipeline should be verified.
Application binary. The software processing your documents should be code-signed by a verified developer. docrew's binary is signed with a developer certificate and notarized by Apple (on macOS) or signed with an Authenticode certificate (on Windows). You can verify the signature before trusting the application with your files.
Model API endpoint. The language model API should be accessed over TLS with certificate verification. The endpoint's identity should be confirmed, not just its encryption.
Sandbox integrity. The execution sandbox should be enforced at the OS level, not just promised by the application. Kernel-level enforcement (Seatbelt, bwrap) means the sandbox can't be bypassed by the sandboxed code.
Update channel. Application updates should be signed and verified. An attacker who compromises the update channel can deliver malicious code that looks like a legitimate update.
3. Enforce least privilege
Every component should have the minimum access necessary for its function.
File access. The agent should access only files in the designated workspace folder. Not the home directory, not the system files, not other applications' data. docrew's agent is scoped to the project folder the user selects.
Code execution. Sandboxed code should have no network access, no access to files outside the workspace, and no ability to modify the system. The sandbox enforces this at the kernel level.
Model API access. The API call should include only the text needed for the current task. No background data collection, no telemetry that includes document content, no persistent context that accumulates across sessions.
User authentication. Each user should be authenticated and authorized independently. The agent should use the user's credentials, not shared API keys that grant broader access.
4. Assume compromise
Zero trust assumes that any component can be compromised. Design the system so that a compromise of one component doesn't compromise everything.
If the model API is compromised: The attacker gains access to text content sent during the compromise window, but not to the raw files (which never left the device), not to historical documents (which weren't stored at the API), and not to the local file system.
If the endpoint device is compromised: This is the worst case -- the attacker has access to everything the user has access to. But this risk exists regardless of whether you use local or cloud AI. The key difference: with local AI, the compromise is contained to one device. With cloud AI, the compromise of one user's credentials may grant access to their cloud-stored documents across all devices.
If the application is compromised (via update): Code signing verification prevents unauthorized updates. If somehow bypassed, the sandbox limits what the compromised application can do -- it can't access files outside the workspace or make network connections beyond the model API.
If the sandbox is bypassed: This would require a kernel vulnerability (Seatbelt or bwrap escape). These are rare and high-value exploits. Even if the sandbox is escaped, the attacker is on the local device -- they don't automatically gain access to a cloud service containing thousands of other users' documents.
5. Log and audit everything
Zero trust requires visibility. You can't verify what you can't see.
Local audit logs. Every agent action -- file reads, tool executions, model API calls, file writes -- should be logged locally. These logs enable post-incident analysis and ongoing monitoring.
Network traffic monitoring. The only outbound connections should be to known endpoints (model API, authentication, updates). Any unexpected outbound connection is an anomaly worth investigating.
API usage logs. The model API should log usage metadata (timestamp, token count, model used) without logging document content. These logs support billing reconciliation and usage anomaly detection.
File access tracking. On operating systems that support file access auditing (all major platforms), the agent's file reads and writes are visible in system audit logs. This provides an independent record of what documents the agent accessed.
Implementing zero-trust document processing
Step 1: Audit your current pipeline
Map every system that touches your documents during AI processing. For each system, answer:
- What data does it receive? (raw file, extracted text, metadata)
- How long does it retain the data?
- Who has access?
- How is it protected?
- How would you know if it was compromised?
If you're using cloud AI, this map likely includes 5-10+ systems. If you can't answer these questions for every system, you're trusting by default -- not verifying.
Step 2: Minimize the pipeline
Reduce the number of systems that handle your documents. The ideal pipeline has three components:
- Local device (reads files, extracts text)
- Secure transport (TLS to model API)
- Model API (processes text, returns results)
docrew implements this minimal pipeline. No intermediate storage, no file upload servers, no processing queues, no logging databases that accumulate document content.
Step 3: Enforce boundaries
Implement technical controls at each trust boundary:
- Device to network: Only text content crosses this boundary. Enforce via local parsing (files never transmitted).
- Network to model API: TLS with certificate verification. Enforce via the application's HTTP client.
- Agent to file system: Workspace-scoped access. Enforce via the agent's file access layer.
- Agent to code execution: Sandboxed with no network, restricted file access. Enforce via OS-level sandbox.
Step 4: Monitor continuously
Set up monitoring for:
- Unexpected outbound network connections
- Unusual file access patterns (agent reading files outside normal scope)
- API usage anomalies (unusually large requests, unusual hours)
- Sandbox violations (if the OS reports sandbox breach attempts)
Step 5: Respond to violations
Define responses for detected violations:
- Unexpected network connection: Isolate the device, investigate
- Unusual file access: Review agent logs, verify user activity
- API anomaly: Rotate credentials, review usage logs
- Sandbox violation: Treat as potential compromise, escalate to security team
Zero trust and compliance
Zero-trust document processing aligns naturally with compliance requirements:
GDPR (data minimization): Sending only text, not files, to the model is data minimization in practice.
HIPAA (minimum necessary standard): Processing only the information needed for the task, with the rest staying local.
EU AI Act (cybersecurity): Minimized attack surface, verified components, and audit logging meet cybersecurity requirements for AI systems.
SOC 2 (access controls): Least-privilege access, authenticated components, and continuous monitoring satisfy access control criteria.
The compliance work is simpler because the architecture is simpler. Fewer systems, fewer data flows, fewer trust relationships to document and manage.
The trust inversion
Traditional cloud AI asks you to trust the provider with your documents. Zero-trust document processing inverts this: trust nothing, verify everything, expose the minimum.
The documents stay on your device -- not because you trust your device more than the cloud, but because keeping them local minimizes the number of things you need to trust. One device is easier to secure, monitor, and audit than a distributed pipeline of third-party services.
Zero trust isn't about paranoia. It's about architecture. An architecture where every component is verified, every boundary is enforced, and every access is justified is inherently more secure than one that trusts components by default.
Your documents deserve the same zero-trust treatment that your network already gets. The architecture to deliver it exists today.