Operators trying AI for the first time tend to follow the same arc. They hear about the technology. They watch demos that look magical. They ask their team to wire up an agent for a workflow they actually run. The agent ships. The agent works on the test cases. The agent is celebrated.
Then the agent meets the real work. It hallucinates a customer's account number. It writes a confident summary of a document it never read. It triggers an email to a vendor based on a comment in a Slack thread it misread. It books a meeting on a holiday. It cites a policy that doesn't exist.
Within ninety days, the operator quietly turns the agent off and the team goes back to the spreadsheet.
This is not a story about model quality. The model is fine. The story is about everything around the model that an operating business needs in place before an agent can be trusted with real work — and that almost no one outside a narrow specialist community knows how to build.
The 97.5% number
The figure that gets cited most often comes from product leader Nate B. Jones, who runs the daily AI strategy show AI News & Strategy Daily. His framing on the agent failure rate is blunt:
"Your AI agent fails 97.5% of real work. The fix isn't coding."
— Nate B. Jones, AI News & Strategy DailyThe number is provocative on purpose. Different benchmarks land in different places — some agent benchmarks now hit 30 to 40 percent task completion on long-horizon work, others hover near zero on multi-step organizational tasks. The point is not the precise digit. The point is the order of magnitude. If a benchmark designed for agents finds them failing the majority of tasks, an untrusted agent inside your operation will do worse. Your operation is harder than the benchmark, because your operation has context the benchmark doesn't model.
That context is exactly the missing thing.
What "context infrastructure" actually means
When an experienced operator handles a customer ticket, they aren't just reading the words. They're applying:
- Knowledge of the customer's account history with this company
- Understanding of which exceptions the company allows and which it doesn't
- Awareness of which other team members have touched this account recently
- Domain knowledge about how problems in this product category usually resolve
- An internal sense of when a situation requires escalation
- The current week's promotional rules, vendor outage status, shipping delays
None of that is in the customer's message. None of it is in the language model. All of it lives in the operator's head, in the company's institutional memory, and in scattered systems that no one has structured for retrieval.
An agent missing this context will, by default, fabricate it. That is not a bug — it is the model doing exactly what it was trained to do, which is produce plausible language. Plausible without grounding is hallucination. The fix is not a better model. The fix is to build the system that gives the agent the context the experienced operator has in their head.
Four things every reliable deployment has
The teams whose agents do work in production have, almost without exception, installed four pieces of infrastructure before they trust the agent with anything:
One: a structured retrieval layer. The customer history, account state, policy text, and current operational status are all queryable by the agent in real time, in a structured format the model can rely on. Not "we feed it our wiki." A purpose-built retrieval surface scoped to the workflow.
Two: an evaluation harness that mirrors real work. A library of past examples — real tickets, real deals, real cases — with the correct answer recorded. Every change to the prompt, the model, or the retrieval layer is run against the harness before it touches production. This is the single thing teams skip most often, and skipping it is why their agent regresses silently.
Three: human-in-the-loop on the high-stakes path. The agent does the bulk work. A human reviews specific decisions before they go out. The review interface is fast enough that one human can review work that previously required ten people to produce.
Four: a feedback loop that converts corrections into evaluation data. When the human catches a mistake, the system records it as a new test case. The harness grows. The agent gets observably better month over month, which is the only way operators can trust expanding the agent's scope over time.
None of those four are tools you buy. They are systems you build — and the build skill is unevenly distributed.
Why solo deployment fails
The skills to build context infrastructure live in a relatively small community of practitioners who have shipped multiple AI deployments and learned which evaluation patterns actually catch regressions, which retrieval architectures hold up under real load, and which review interfaces operators actually use without complaining.
An operating business hiring its first AI engineer does not get this skill set. They get someone who can wire an API call — which is the easy part — and who has not yet been on the other side of three deployments that failed silently because the eval harness was thin. The cost of that learning curve, paid by your business, is the deployment that gets quietly turned off in ninety days.
"The humans who invest in contextual stewardship and evaluation design will become the most valuable people in their organizations."
— Nate B. Jones, AI News & Strategy DailyJones is right that contextual stewardship is the new high-leverage skill. He is also right that it is not a coding skill. It is a systems-design skill that requires having shipped, watched fail, debugged, and shipped again — the kind of judgment that doesn't transfer through documentation. The companies catching up to the leaders are the ones importing that judgment from people who already have it, not the ones trying to build it from scratch by trial and error.
What this looks like inside an operating business
The version of this that works for a small or mid-market operating business looks roughly like this. An external practitioner team comes in for four to twelve weeks. The first week is interviewing operators and mapping the workflow. The second week is building the retrieval layer for the most common cases. The third week is building the evaluation harness from real historical cases. By week four, an agent is doing the bulk work and a human is reviewing on the exceptions interface. By week eight, the harness has caught two or three regressions before they reached production, and the team trusts the system enough to expand its scope.
By week twelve, the practitioners leave. The runbooks, the eval harness, the retrieval code, the review interface, and the playbook for adding the next workflow all stay. The business now has the context infrastructure they could not have built alone — and a team that can extend it themselves.
Build the infrastructure once. Compound on it forever.
We embed in your business for 4–12 weeks, build the context infrastructure your AI workflows need to actually work, and leave the runbooks and evaluation harness behind. Costs less than a senior hire. Compounds forever.
Sources & Further Reading
- Nate B. Jones, "Your AI Agent Fails 97.5% of Real Work. The Fix Isn't Coding." AI News & Strategy Daily (~Dec 2025). Source for the 97.5% framing and the "fix isn't coding" argument.
- Nate B. Jones, "The Real Problem With AI Agents Nobody's Talking About." YouTube (Apr 15, 2026). Source for the "memory wall" / contextual stewardship framing.
- S&P Global, 2024 enterprise AI survey. 42% of organizations abandoned the majority of their AI initiatives. Cited in Nate B. Jones, The AI Agent Playbook (Substack guide).
- 8bitconcepts internal engagement data, n=11 embedded deployments (2025–2026). Pre-engagement audits routinely find 0–1 of the four context-infrastructure components present.