← ALL WRITING
22 APR 2026 · 11 MIN READ
Agentic AILLMsMCPEVALSProduction

Practical vs Theoretical Agentic Systems

Most production wins called "agents" are workflows with one tool-calling node. A field report on what actually ships, where the loops break, and why MCP — not the planner — was the real unlock.

NIRBHAY AGRAWAL22 APR 2026

For most of 2023 and 2024, the word “agent” meant whatever the speaker wanted it to mean. Simon Willison got tired enough of this that he crowdsourced 211 working definitions and asked Gemini to cluster them. He eventually settled on something workable: an LLM that runs tools in a loop to achieve a goal. That’s the version most working engineers use now. It’s also the version that, when you actually apply it, makes a lot of so-called production agents look like something else.

I’ve put a few of these into production. There’s an HTSUS classifier I built last year that labels nine hundred thousand line items overnight at about ninety-six percent accuracy. There’s a domain chatbot on a commerce funnel that pushed conversion from 27.6% to 32.1%, and a PR-review pipeline that uses retrieval to flag drift in code contributions. The marketing copy on all three calls them agents. By Willison’s definition, one of them barely qualifies. The other two are workflows with a tool-calling LLM at one step, and honestly, that’s the version that ships.

The definition was the bug

The most useful piece of writing on this is Anthropic’s Building Effective Agents from December 2024. The line that did the most damage to the ecosystem: “The most successful implementations weren’t using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.” Anthropic splits the world in two. Workflows are predefined code paths with an LLM at one of the steps. Agents direct their own processes dynamically. The implicit punch line, which most readers skipped past, is that almost everything you should build is a workflow. Workflows look boring in a demo and behave well in production. Agents demo great, then surprise you on the bill.

The reason matters. The second half of the loop, where the model decides what to do next, is where your costs and failure modes live. If you can write that loop in code, you keep nearly all the controllability you’d otherwise lose to the planner.

What actually ships

Eugene Yan keeps an old but well-maintained list called Patterns for Building LLM-based Systems & Products. Seven entries: evals, RAG, fine-tuning, caching, guardrails, defensive UX, feedback collection. None of them is “build an agent.” If you trust Yan’s production taste, and most of us do, that’s a real signal about what carries weight in production.

The conversion-lift chatbot is the cleanest example I have. It doesn’t plan. It does retrieval over a structured product catalogue plus a stash of policy and tone documents, behaves like a sales associate (because the prompt told it to), validates output against a schema, and gets graded against a fixture set of golden conversations. Six of Yan’s seven patterns, in code paths I can read top to bottom in an afternoon. We tried the agent-shaped version. It was harder to debug, three times the token budget, and never beat the workflow on conversion. I’ve stopped pitching it.

The HTSUS classifier is closer to an agent. It does actually iterate: query a tariff index, evaluate candidate codes against rules, ask a smaller model for a tiebreak when confidence dips, emit a code with a justification. Even there, the loop is bounded at three steps, and the orchestrator is a switch statement, not a planning prompt. A real plan-and-execute architecture would have cost us two extra months of calibration and probably 4x the tokens. We chose not to find out for sure.

The multi-agent schism

On June 12, 2025, Cognition published Don’t Build Multi-Agents. The argument: parallel sub-agents diverge because each one makes implicit decisions the others can’t see. Their illustrative example is a Flappy Bird clone where one sub-agent rendered a Mario-style background while another rendered a bird that looked like it came from a different game. Neither knew what the other had chosen. Their recommendation was single-threaded linear agents with the model compressing context between steps.

Twenty-four hours later, Anthropic published How we built our multi-agent research system. Their orchestrator-with-subagents pattern beat single-agent by 90.2% on an internal research eval. It also used about 15x the tokens of a chat session, and 4x a single-task agent. They were honest about it. “Tasks must have a value high enough to pay for the increased performance.”

People treat these two posts as opposing arguments. They’re not. They’re saying the same thing from different sides. Multi-agent is a token-budget decision dressed up as an architecture decision. If you can’t justify a 15x spend to your CFO, you don’t have a multi-agent problem. You have a single-agent calibration problem.

MCP quietly won the year

The biggest agent-architecture event of the last two years wasn’t a smarter planner. It was MCP, released by Anthropic in November 2024. By late 2025 it was at roughly 97 million monthly SDK downloads and 10,000+ active servers, and the protocol was donated to the Linux Foundation’s Agentic AI Foundation. Every major model provider now speaks it.

MCP didn’t make planners smarter. It made tools coherent. The actual bottleneck for production agents, in my experience, was almost never the reasoning step. It was that every team rolled their own tool surface, half OpenAPI, half scraping utility, half internal RPC nobody documented, and the model burned a third of its budget just reverse-engineering the contract. With a stable protocol you stop paying that tax. The same agent that couldn’t close a flight booking in 2024 closes it in 2025, not because it got smarter, but because the booking tool can describe itself in a way the model can’t misread.

This is, weirdly, Pat Helland’s argument from twenty years ago. Data on the outside is immutable, versioned, in the past. Tools in MCP behave like data on the outside. They have schemas, versions, and stable identities. That, more than any chain of thought trick, is why agents in 2026 work in places they didn’t in 2024.

Where it breaks

Cognition’s SWE-bench breakdown is still the cleanest artifact I’ve seen of where reliability falls off. Of Devin’s failed tasks, 95 required more than one file changed and 230 required more than fifteen lines of code. The medians on successful tasks were eleven lines and one file. The full data’s in their public results repo, and the 2025 update has the merge rate climbing from 34% to 67%, which is real progress. But the cliff is real too: agent reliability collapses at the same place a human reviewer’s attention does, somewhere past fifteen lines or two files.

The other failure modes are less photogenic. Planner drift, where multi-step plans degrade because the prompt becomes a transcript of every previous step. Hallucinated tool calls, where the model invents an endpoint, gets a 404, decides the failure was transient, and retries. Context-window blowups, where a workflow that fit in 16k at design time hits the 128k cap in production because real users are long-winded. The KV-cache cost is quadratic in turn count, so “just truncate” isn’t the answer it sounds like.

And the one that kills more agents than anything else: eval debt. Shankar et al. published a paper called Who Validates the Validators? that names the problem nicely. Criteria drift. Users can’t define what makes an output good until they’ve graded a few, but the grading depends on the criteria, so the two co-evolve. The eval suite that justified the demo is already wrong by the time real traffic hits. Hamel Husain has been writing the canonical version of this argument for two years; his Your AI Product Needs Evals is the post I keep linking people to. The HTSUS classifier worked because we spent three weeks building the eval harness before we spent two weeks on the classifier itself. We didn’t plan that ratio. It just happened to be what the work demanded.

The bill comes due

The senior-engineer instinct that gets vindicated in this space is cost. A single-completion price quote is almost always misleading for an agent. A real agent run is a hundred completions wide and ten deep. Anthropic’s 15x and 4x numbers are the floor for systems that are well-built. The Stevens Online piece on the hidden economics of AI agents walks through a $50/month POC that becomes $2.5M/month at production volume. The math isn’t exotic. It’s quadratic context growth, multiplied by per-completion overhead, multiplied by real users using the thing in ways you didn’t test for.

So the honest version of “agents are getting better” looks something like this. Models got smarter. MCP made tools coherent. The patterns that ship are mostly workflows with one tool-calling node. The cost curve still bites every team that doesn’t plan for it. If a workflow gets you eighty percent of the result at ten percent of the spend, you don’t have an agent problem. You have a procurement problem.

Build workflows. Promote one to an agent only when you can defend the loop in writing, defend the eval suite to a skeptic, and defend the bill to whoever pays it. Until then, “agentic” is a marketing word, not an architecture.