Agents have a shape: what a production-ready AI agent actually looks like
A demo is not a system. Here is the anatomy of an AI agent that can survive contact with production, and how we build them.
Inteeka · 16 June 2026 · 6 min read

Almost anyone can build an AI agent that demos well. Wire a capable model to a couple of tools, give it a confident prompt, and within an afternoon you have something that looks remarkable on a screen. The hard part is everything that happens after the applause. At Vercel Ship London the recurring theme across sessions (“Agents have a shape”, the launch of Vercel Agent, and the CTO panel on agents in production) was the same uncomfortable truth: a demo is a moment, and a production agent is a system. This piece is about the difference, and about the structure a real agent needs to do work people can depend on.
The gap between a demo and a system
A demo only has to work once, on a path you have already walked. A production agent has to work on inputs nobody anticipated, for users who do not care how it was built, on a Tuesday when a downstream API is slow. The gap between those two things shows up in five concrete places.
- Reliability: the agent must behave sensibly when a tool fails, a model returns nonsense, or a step times out, rather than confidently inventing a result.
- Cost: a chatty multi-step agent can quietly cost orders of magnitude more per task than the prototype that charmed the room.
- Latency: a reply that takes thirty seconds in a demo is a curiosity; in a support queue it is a churned customer.
- Security: once an agent can act, every tool it holds is a permission someone could abuse through it.
- Observability: when something goes wrong at scale, you need to see what the agent decided and why, not guess.
None of these is visible in the demo. All of them decide whether the agent is worth running.
The shape of a real agent
The phrase “agents have a shape” is a useful corrective. An agent is not a single clever prompt; it is an assembly of parts, each of which earns its place. When we sketch one before building, the same components appear every time.
- A model chosen for the job: capable enough to reason, cheap and fast enough to run at volume.
- Well-scoped tools: a small set of functions that do exactly what is needed and nothing more, each with narrow, validated inputs.
- Memory and context: the right information assembled for the task, and a clear record of what has happened so far.
- A control loop: the logic that decides when to call a tool, when to ask for help, and when the job is done.
- Guardrails: limits on what the agent may do, with validation and approval where the stakes are high.
- Evals and observability: measurement of quality before release and visibility into behaviour after it.
- Infrastructure to run it: durable execution, retries, queues and sandboxes so the agent survives failure rather than disappearing into it.
Drop any one of these and the system becomes fragile in a predictable way. The shape is not decoration; it is what makes the agent dependable.
Why fire-and-forget is the wrong model
The mental model from the demo is a single prompt: ask, wait, receive. Real work does not fit that shape. A meaningful task (reconciling an order, triaging a ticket, drafting and sending a report) unfolds over many steps, and any step can fail. If the whole thing lives inside one request that either succeeds or vanishes, you have built something that cannot be trusted with anything that matters.
This is why agents need durable, cancellable execution. Durable means a run can pause, resume after a restart, retry a failed step and pick up exactly where it left off. Cancellable means a person can stop a run in flight without leaving things half-done. And the most consequential actions should sit behind a human-in-the-loop checkpoint: the agent prepares the work and a person approves it before money moves or an email goes out. That is not a lack of ambition; it is how you earn the right to automate more over time.
Eval-driven development
The single biggest difference between teams whose agents improve and teams whose agents drift is whether they measure. You cannot improve what you do not measure, and “it looked good when I tried it” is not measurement. Evals (a representative set of real tasks with checks on whether the agent got them right) turn vibes into evidence.
The order matters: build the evals before you scale, not after the first incident. With a suite in place, every prompt tweak, model swap or new tool can be judged against a stable baseline, so you ship changes that genuinely help and catch the ones that quietly make things worse. Without it you are tuning blind, and an agent that touches production is an expensive thing to tune blind.
How Inteeka builds
Our approach follows the shape. We start by scoping the job-to-be-done: one clearly defined task with a measurable outcome, rather than a vague aspiration to “add AI”. We instrument it with evals early, so we know what good looks like before we widen the remit. We deploy on Vercel’s agentic infrastructure, building with the AI SDK and frontier models like Anthropic’s Claude, so durable execution, retries and sandboxing are part of the foundation rather than an afterthought. And we monitor in production, watching cost, latency and quality, because an agent is a living system that needs tending, not a feature you ship once.
The takeaway
The lesson from Vercel Ship London is encouraging, not discouraging. The path from impressive demo to dependable system is no longer a mystery. It has a recognisable shape, and the infrastructure to support it has matured. The work is real, but it is known work: scope tightly, give the agent the right tools and limits, run it on durable infrastructure, and measure relentlessly. Do that, and an agent stops being a party trick and starts being a colleague.