Most companies think they have an AI agent problem. They have a process maturity problem.

Two years of working with engineering teams across manufacturing, healthcare, finance, and technology reveals a consistent pattern. The teams that are shipping reliable, production-grade agentic systems are not doing it because they have better models. They are doing it because they have moved through a specific engineering maturity progression that most teams have not discovered yet -- and, more importantly, do not know exists.

The teams that are struggling are stuck at the same level, prompting harder at the same problems, convinced that the next model release will close the gap. It will not. The gap is not in the models. It is in how the teams are using them.

3-5x
development velocity improvement at L4-L5 vs L1-L2 teams (internal observation, 2025)
L3
the level where most teams plateau -- and where the highest-leverage intervention lives
6%
of organizations reach systematic validation (L6) -- the rarest and most durable maturity level

The Six Levels

The maturity model below is descriptive, not prescriptive. Teams do not progress through it by following a roadmap. They progress by solving the real problems in front of them and discovering that each solution creates the foundation for the next. The model is useful because it tells you where you are, what the ceiling looks like from your current level, and what the next transition requires.

L1 Conversational Prompting

Most companies are here. They have bought access to a frontier model. They are having conversations with it. Results vary significantly based on how questions are phrased, which engineer wrote the prompt, how much context was included, and which direction the wind is blowing. This is the "prompt-craft lottery" phase: some inputs produce excellent results, similar inputs produce poor ones, and the team has no systematic explanation for the difference.

Teams at L1 are constantly surprised by inconsistency. The same prompt produces different outputs across sessions. Output quality correlates with individual skill at phrasing rather than with any reproducible system. The mental model at L1 is that the model is a conversational partner: you talk to it and it responds. The better you are at conversation, the better the results. This mental model is not wrong -- it is just insufficient for anything that needs to be reliable, repeatable, and maintainable at production scale.

L1 is adequate for personal productivity tasks where quality variance is acceptable and every use is one-off. It is not adequate for anything that runs repeatedly, touches external systems, or produces outputs that downstream processes depend on.

L2 Planning First

The L1-to-L2 transition is conceptual. It requires adopting a specific mental shift: the conversation with the model is the input to the work, not the work itself. Before invoking an agent, you write a structured specification -- clear task definition, explicit constraints, defined output format, success criteria. The specification is the work. The agent invocation is execution against a spec.

Teams at L2 see immediate quality improvements because they are giving the agent deterministic instructions rather than conversational hope. The output variance drops dramatically when the input is a structured specification rather than an open-ended request. Engineers who make this transition often describe it as "obvious in retrospect" -- the model is better at executing clear instructions than at inferring what you want from conversational fragments.

L2 is achievable by most teams within weeks of deciding to try. It does not require new tools, new infrastructure, or new team members. It requires a decision to invest 20 minutes in writing a proper specification before invoking the agent instead of iterating conversationally after the fact. Most teams that make this transition report that L1 starts to feel visibly wasteful -- the time spent in conversational iteration was always greater than the time spent on upfront specification.

L3 Skills

L3 is where most teams plateau. The technical step is straightforward: encoding expertise into reusable, versioned, composable units rather than one-off prompt chains that live in a single engineer's head or a shared Notion document that nobody trusts. Skills are the difference between "ask Sarah how to write the prompt for this" and "here is the versioned, tested, documented skill for this task." One scales. The other does not.

The blocker at L3 is not technical. It is organizational. Building a skill library requires the team to agree on standards: what belongs in a skill, how skills are documented, how they are tested, how they are versioned, who can modify them and under what review process. That is a cultural shift. It requires someone to own skill quality -- not a committee, because committees diffuse accountability, but one person whose job it is to maintain the library's reliability and coverage.

Teams that plateau at L3 are typically stuck because they started building skills without the organizational infrastructure to maintain them. Skills get created, used once, and never updated when the underlying task changes. The library grows but becomes untrustworthy. Engineers stop using it. The pattern reverts to ad-hoc prompting. The fix is not technical: it is assigning a DRI, establishing a review process, and treating the skill library as a product with users and owners rather than a documentation project that nobody is responsible for.

L4 Agents as Infrastructure

At L4, agents are deployed, monitored, and maintained like any other production service. They have owners. They have on-call rotations. They have defined SLAs. Failure modes are documented before they are discovered in production. Rollback procedures exist and have been tested. This sounds like table stakes for any software system -- because it is. The surprising thing is how few AI deployments reach it.

The L3-to-L4 transition requires treating agents as infrastructure rather than features. A feature is owned by the team that built it. Infrastructure is owned by the organization. That shift changes how it is funded, how it is maintained, how failures are handled, and how improvements are prioritized. Features get deprecated. Infrastructure gets upgraded. The teams that succeed at L4 made this distinction explicitly, early -- usually after their first significant production incident, which forced the question of who owned the broken system and what the remediation process was.

L4 also requires standardizing the observability and monitoring that makes infrastructure manageable. Which metrics matter for agent quality? What constitutes a degraded state versus a normal variance? What are the alerting thresholds? These questions have to be answered before the incident, not during it. Teams at L4 have answered them. Teams below L4 typically discover the questions exist when they are already in an incident.

L5 Orchestration

A single context window is finite. A fleet of specialized agents is effectively infinite. L5 teams have decomposed complex tasks into parallel workstreams, built specialized agents that handle specific domains, and implemented a coordinating layer that assembles results into coherent outputs. The velocity improvement at L5 compared to L1 is significant -- tasks that took hours at L1 complete in minutes at L5, not primarily because the models are faster but because parallelism eliminates the sequential bottleneck of a single context window handling everything.

L5 requires both technical and organizational infrastructure that L4 builds. Without versioned skills (L3), parallel agent fleets cannot coordinate reliably -- each agent makes different decisions in equivalent situations. Without production-grade monitoring (L4), orchestration failures are opaque and hard to diagnose. L5 is not a shortcut to skip the earlier levels. It is the natural destination for teams that have worked through them.

The engineering pattern at L5 is decomposition: identifying which parts of a complex task are independent (can run in parallel), which are sequential (must wait for upstream outputs), and which require coordination (must reconcile conflicting outputs from multiple agents). That decomposition is domain-specific. It cannot be templated across industries. The teams that do it well have developed deep familiarity with both the task domain and the failure modes of agentic systems -- which only comes from time at L3 and L4.

L6 Systematic Validation

L6 is the rarest maturity level. Every agent output is evaluated against a defined quality benchmark before it reaches the downstream consumer or triggers a downstream action. This is not periodic sampling. It is systematic: every output, every time, evaluated against criteria that were defined precisely enough to be benchmarkable.

L6 is rare not because it is technically hard but because it requires something culturally difficult: treating agentic engineering as a discipline with standards. Standards require defining what "correct" means precisely enough that a benchmark can evaluate it. Most teams have not done this work because it is uncomfortable -- it forces explicit answers to questions like "what does good actually look like, and how would we know if we were not getting it?" These questions are harder to answer than they sound, especially for novel tasks where human judgment has been the de facto quality standard.

The teams that reach L6 typically arrive there after a production failure that made the absence of systematic validation very expensive. The failure reveals that they had implicit quality standards they had never made explicit, and that the implicit standards were not reliably enforced. L6 is the response to that recognition. The goal is to make quality enforcement explicit, automated, and non-negotiable -- not as a constraint on velocity, but as the infrastructure that makes sustainable velocity possible.

The Highest-Leverage Transition: L2 to L3

Every transition in this ladder matters, but the L2-to-L3 transition is where we see the most friction and the most value locked up. It is the moment where individual expertise has to become organizational expertise -- where "ask the expert" has to become "consult the library."

The reason this transition is hard is that it requires moving from implicit knowledge to explicit knowledge. The expert who writes great prompts has deep intuitions about what works. Those intuitions are not easily written down. They resist documentation. The first attempt to systematize them produces documentation that is too general to be useful. The second attempt produces something more specific but quickly goes stale. The third attempt, with a dedicated owner and a process for keeping skills current, starts to work.

Most teams that plateau at L3 are not failing at the technical work. They are failing at the organizational work. They built skills without building skill ownership. The fix is always the same: one named person, clear accountability, a review process, and the authority to keep the library trustworthy.

The organizations that clear this transition tend to do it the same way: they appoint a single DRI for skill quality, not as a full-time role initially but as an explicit accountability. That person owns the skill library the way a platform team owns a platform -- with standards, with a review process, and with the authority to say "this skill doesn't meet the bar and won't ship until it does."

What the Velocity Gap Actually Looks Like

Teams operating at L4-L5 are seeing 3-5x improvements in development velocity relative to their L1-L2 starting point. This is consistent with GitHub's analysis of Copilot adoption at scale: the teams seeing the largest gains are not the ones who adopted AI tools earliest. They are the ones whose organizations adapted to support the tools -- which maps directly to the maturity progression described here.

The velocity improvement is not primarily from faster task completion. It is from elimination of rework, faster debugging through observability, parallel execution that eliminates sequential bottlenecks, and systematic quality gates that catch problems before they reach production. The model is not faster at L5 than at L1. The system around the model is more efficient.

This distinction matters for setting expectations. Teams that expect a 5x velocity improvement from better prompting alone will be disappointed. Teams that expect it from moving through the maturity ladder -- with the organizational investment that requires -- will find it achievable. The difference is where the effort goes: into the model or into the system around the model.

Level Key Capability Primary Blocker Typical Timeline
L1 Conversational prompting Inconsistent results, prompt-dependency Day 1
L2 Structured specifications Mental model shift; habit change 1-4 weeks
L3 Versioned skill library Organizational discipline; DRI assignment 1-3 months
L4 Production-grade monitoring Treating agents as infrastructure 2-4 months
L5 Parallel orchestration Task decomposition expertise 4-8 months
L6 Systematic validation Defining "correct" precisely; benchmark build 6-12 months

The Compounding Problem

Teams at L4-L5 are not just more productive today. They are compounding. Every skill they add to the library accelerates future work. Every observability investment makes the next incident cheaper to resolve. Every evaluation benchmark they build makes the next quality improvement faster to validate and safer to deploy. The infrastructure accumulates.

Teams at L1 are not compounding. They are accumulating prompts -- in Notion documents, in Slack threads, in engineers' heads -- that do not transfer, do not compose, and do not survive team turnover. The gap between these two trajectories is not linear. It is exponential. The teams that are 12 months ahead on this maturity curve are not 12 months ahead on outcomes. They are further ahead than that, and the gap is growing.

The good news is that the transition from L1 to L3 -- the highest-leverage segment of the ladder -- does not require new tooling, new headcount, or a significant capital investment. It requires a decision: to treat agentic engineering as a discipline with standards rather than a practice with enthusiasts. That decision is free. The organizational effort to execute it is not free. But it is bounded, achievable, and within reach of most engineering organizations that have already committed to AI as a capability.