4 min read
Designing Codebases for AI Agents
Codebase architecture is the real bottleneck on AI capability. Notes from Matt Pocock and David Ondrej on maximizing agent experience.
- ai-tools
- workflow
- engineering
Source: David Ondrej interview with Matt Pocock (full video).
AI coding agents are getting smarter every month. The bottleneck is no longer the model. It is the codebase the model is asked to read, change, and reason about.
Matt Pocock’s interview with David Ondrej covers architectural strategies for designing software so agents can work inside it effectively. The conversation is about treating agent experience (AX) as a first-class engineering constraint.
Agent Experience as a design constraint
Codebase design is the ultimate constraint on AI capability. Raw model intelligence has improved fast enough that the weak link is now how your repository interfaces with the agent.
The practical consequence is that you should spend less time chasing day-one model releases and more time on the wrapper architecture. Precision scoping, clean interfaces between modules, decoupled system design, and deterministic test scenes matter more than whichever model shipped last Tuesday.
This is not an abstract principle. It has a direct effect on token cost and hallucination rates.
A clean, modular codebase lets you use smaller, cheaper, dumber models because the codebase boundaries act as guardrails. The module interfaces constrain what the model can do wrong. A messy ball of tightly coupled files forces even the best model to waste tokens navigating relationships that should not exist.
Stateful tooling writes to the repo
Most AI tasks are stateless. You ask, the model answers, the conversation ends.
Complex tasks need local state that survives across runs. Pocock’s skills CLI shows how. His tool writes deterministic tracker files directly into the repository workspace: mission.md, learning records, HTML output. Files in the repo itself.
This is deliberately different from cramming messy chat logs into the model’s transient context window. The files persist. They are readable by the agent on the next run, by other agents in the same workspace, and by humans reviewing what happened. The repository becomes the system’s long-term memory, not the context window.
AFK: queues over loops
The interview dismantles the hype around infinite autonomous agentic loops.
Pocock describes the “Ralph loop” problem: persistent while constructs that continuously pipe self-evaluation loops back to the model. It burns tokens, produces low-yield output, and is marketed primarily to sell compute credits.
A production-grade alternative treats agents as parallelized, stateless worker nodes pulling tasks from a traditional project backlog. The system uses GitHub labels like agent-implement to trigger single-purpose workflow runs via CI/CD pipelines. No infinite loops. No agents deciding what to do next on their own. A labeled issue triggers a bounded, deterministic execution.
True AFK execution requires sandboxed runtime environments. Running agents natively on a development host exposes the machine to unvetted command execution and credential exfiltration. The architecture needs isolated containers: Docker, Podman, or temporary remote instances via tooling like Pocock’s sandcastle.
Human review checkpoints remain standard, but the friction of evaluation needs engineering down. Pocock notes advanced teams run front-end agents that record video of their own workspace changes, overlay text-to-speech commentary describing what changed, and append the media file directly into a pull request for single-click verification.
Procedures over autonomous abilities
Standard agent setups broadcast descriptions of every available tool into the system prompt. When you scale to dozens of custom tools, those descriptions bloat the context, dilute instruction adherence, and drive up token overhead.
The fix is to configure custom tools as user-invoked procedures rather than model-invoked abilities. By setting disable_model_invocation: true on a tool configuration, the tool description is hidden from the ambient context window until the human manually triggers it. The model cannot see or invoke tools it does not need for the current task. The context stays lean.
Pocock’s “grill me” pattern is a popular example of a procedural prompt. It is a lean 4-5 sentence structure that turns the model into an aggressive adversarial interviewer. Instead of jumping to one-shot code generation, the pattern forces the human to defend their architectural design, smoke-test edge cases, and establish explicit requirements before any code is written.
The pattern works because it inverts the normal flow. Most agent setups let the model generate first and correct second. Grill me forces requirements discovery before generation, which produces better output and reduces the number of corrective loops.
The pragmatic de-bloating blueprint
First, delete every custom instruction file, system prompt, .cursorrules, claudecode.md, plugin config, and unused MCP server. Most active setups are choked by conflicting, bloated, or outdated multi-page constraints that degrade modern model performance.
Second, run your development agent on a completely blank slate to evaluate its baseline reasoning. You may find the stock model is better than the stack of rules you layered on top of it.
Third, gradually layer back single-purpose, human-triggered procedural tools only as you identify repetitive, programmatic bottlenecks in your workflow. Add nothing by default. Add only what a real gap demands.
Watch the full interview for more detail on each topic at youtu.be/nQwJVHCtDDY.