Agentic businesses are usually described by the visible work: an agent writes a message, generates a report, ships a change, answers a customer, scores a lead, or sends some artifact into the world.
That is not the business. That is the output surface.
A durable agentic business is a closed-loop operating system. The agent acts, the system observes the result, the artifact is scored against an explicit standard, the score changes future behavior, and failures become new tests, rules, monitors, or constraints.
The key question is not, "Can an agent do this task?"
The key question is, "Can the system tell whether the agent did the task well, and can it make the next attempt better without waiting for a human to notice?"
That claim is not a single-paper conclusion. It is a synthesis from several research lines:
- Human preference learning showed that useful behavior can be learned from comparative feedback, even when the reward function is hard to write down directly [1].
- RLHF and instruction-following work showed that model behavior improves when systems collect demonstrations, rankings, and preference signals tied to user intent [2].
- Constitutional AI showed that AI systems can critique, revise, and preference-rank outputs against a written rule set, reducing dependence on direct human labels for every case [3].
- Reflexion and Self-Refine showed that agents and LLMs can improve outputs through explicit feedback, revision, and memory at inference time, without necessarily changing model weights [4][5].
- LLM-as-judge research showed that model-based evaluators can approximate human preference in some settings, but also carry position, verbosity, self-enhancement, and reasoning biases [7][8].
- Agent benchmarks such as AgentBench, WebArena, SWE-bench, and TheAgentCompany show that real agent performance needs to be measured in interactive, tool-using, long-horizon environments, not only static prompt benchmarks [10][11][12][14].
- Evaluation infrastructure work, including OpenAI Evals and the NIST AI Risk Management Framework, reinforces the need for repeatable evals, evidence, governance, measurement, and management around AI systems [15][16].
The business implication is direct: an agentic company cannot rely on one-shot generation. It needs an artifact-level feedback system.
1. The open-loop failure mode
An open-loop agent receives an instruction, produces output, and moves on. It may appear productive at first. It sends more emails. It produces more reports. It answers more support tickets. It publishes more content. It opens more pull requests.
The problem is that volume hides drift.
An email can be grammatically clean and still be bad. A report can be long and still be useless. A support reply can answer the literal question while missing the customer's real issue. A code change can pass the happy path and still break an edge case. A research memo can cite sources and still fail to change a decision. An agent can complete the assigned action while damaging trust, wasting budget, creating operational risk, or teaching itself the wrong pattern.
Open-loop systems optimize for completion because completion is the easiest thing to measure. The message was sent. The artifact was created. The job was marked done.
Completion is not quality.
The agent research literature points in the same direction. WebArena found that even a GPT-4-based web agent reached only 14.41% end-to-end task success on realistic web tasks, far below human performance at 78.24% [11]. TheAgentCompany, which simulates knowledge work inside a small software company, reported that its most competitive baseline agent completed 24% of tasks autonomously [14]. SWE-bench found that real-world software issues require multi-file reasoning, execution, and long-context coordination that go beyond traditional code generation [12].
The lesson is not that agents are useless. The lesson is that real environments expose failures that simple completion metrics miss.
Agentic companies do not fail only because agents make mistakes. They fail because mistakes do not become structure.
2. What the research establishes
Preference feedback matters when goals are hard to specify
Many business outputs are hard to score with a simple rule. A sales email, support answer, research memo, product decision, or implementation plan can be better or worse in ways that are obvious to an expert but difficult to encode as a deterministic reward function.
Preference-learning research addresses exactly that gap. Christiano et al. showed that reinforcement learning systems can learn complex behaviors from human preferences over trajectory segments, using feedback on less than one percent of agent interactions in their experiments [1]. Ouyang et al.'s InstructGPT work extended the pattern to language models: collect demonstrations, collect rankings of model outputs, train a reward model, then optimize the model against that reward signal [2].
For agentic businesses, the translation is straightforward. If "good" is difficult to write as a formula, the system still needs preference data. That data can come from humans, customers, evaluators, outcome metrics, or AI judges calibrated against known examples.
Self-critique can help, but only when attached to evaluation
Constitutional AI used a written set of principles to generate critiques, revisions, and AI preference labels, then trained models using those feedback signals [3]. Reflexion used verbal reflections stored in episodic memory so agents could improve on subsequent attempts [4]. Self-Refine used iterative feedback and refinement, reporting improvements across tasks such as dialogue response generation, code rewriting, constrained generation, and toxicity removal [5].
These papers support the core mechanism: feedback has to be represented in a form the next attempt can use.
The important caveat is that self-critique is not magic. A model judging its own work can reinforce its own blind spots. LLM-as-judge research finds useful agreement with human preference in some open-ended settings, but also identifies position bias, verbosity bias, self-enhancement bias, and limited reasoning ability [7]. G-Eval similarly shows promise for LLM-based evaluation while flagging bias toward LLM-generated text [8].
The practical answer is role separation and calibration. The maker agent should not be the only judge. The system needs deterministic checks, model reviewers, outcome data, human calibration samples, and audit trails.
Agent evaluation has to include cost, robustness, and real-world fit
AI Agents That Matter argues that agent benchmarks over-focus on accuracy and under-measure cost, reproducibility, holdout quality, and downstream usefulness [13]. That critique matters for businesses. An agent that is 2% better but 10x more expensive may be worse. An agent that wins a benchmark by overfitting is not reliable. An agent that succeeds only in a narrow synthetic environment may fail in production.
Agentic businesses need scorecards that include:
- Task success.
- Cost.
- Latency.
- Risk.
- Reliability.
- Recoverability.
- Human escalation rate.
- Outcome impact.
- Evidence quality.
This is where a business scorecard diverges from a research leaderboard. The goal is not abstract intelligence. The goal is reliable work under business constraints.
Governance has to be built into the loop
NIST's AI Risk Management Framework centers on governance, mapping, measurement, and management of AI risks across design, development, deployment, and use [16]. That maps cleanly onto agentic businesses. The system has to know the intended use, measure performance and risk, manage failures, and maintain accountability.
This is not compliance theater. It is operating infrastructure.
When agents act externally, governance becomes product behavior: budget limits, duplicate locks, source requirements, approval thresholds, permission boundaries, rollback paths, audit logs, and live monitoring.
3. The operating model: act, score, learn, prevent
The minimum viable agentic business has five layers.
Action layer
Agents perform work: write, send, research, build, classify, route, sell, support, deploy, monitor, reconcile, or negotiate.
Artifact layer
Every action produces an inspectable artifact: a message, diff, report, invoice, lead score, API response, customer note, research memo, deployment log, rubric score, or decision record.
Measurement layer
The system evaluates the artifact against explicit standards. Some checks are deterministic. Some are model-graded. Some are human-calibrated. Some come from delayed outcome data.
Learning layer
The score changes the system. Strong outputs become exemplars. Weak outputs become improvement targets. Failures become tests, hooks, playbooks, memories, monitors, or policy changes.
Audit layer
The system keeps enough evidence to explain what happened: what was requested, which agent acted, what inputs were used, what artifact was produced, how it was scored, what shipped, what happened afterward, and what changed in the system.
The loop is:
Intent -> Agent action -> Artifact -> Evaluation -> Score -> Learning update -> Next action
The business becomes agentic when this loop runs by default.
4. Defining "good"
There is no universal definition of good agent output. Good depends on the artifact, audience, risk, and business outcome.
The mistake is looking for one master score. The right model is a scorecard tied to the artifact type.
Most agentic work can be measured across seven dimensions.
Goal completion
Did the artifact accomplish the job it was created for?
A support answer should resolve the issue. A code change should work in the target environment. A research memo should change a decision. A lead score should help prioritize a real buyer.
Completion should be measured against the intended outcome, not the agent's internal checklist.
Constraint adherence
Did the agent obey the rules of the business?
This includes budget, brand voice, legal constraints, privacy, security, customer promises, platform policies, and operational playbooks. Constraint failure can outweigh task success. A message that books a meeting by making a false claim is not a good output.
Factuality and evidence
Are the claims true, sourced, and current enough for the use case?
Verifiability research on generative search engines found that fluent, useful-looking answers can still contain unsupported statements and inaccurate citations [9]. Agentic businesses need source checks, citation precision, evidence links, and freshness requirements for factual claims.
Usefulness
Would the recipient or downstream system know what to do next?
Useful work reduces ambiguity. It advances the workflow. It does not merely sound reasonable.
Taste and fit
Does the artifact fit the audience, surface, timing, and business position?
This is where many agent systems underperform. The output is not wrong, but it feels generic, overlong, mistimed, or mispositioned. Taste can be measured through exemplars, rubric grading, recipient behavior, human calibration, and comparison against past high-performing artifacts.
Risk
What could go wrong because this artifact exists?
Risk includes security exposure, privacy leakage, hallucinated promises, reputational harm, customer confusion, duplicated outreach, broken production behavior, regulatory exposure, and irreversible external actions.
Business impact
Did the artifact move the metric it was supposed to move?
Sales messages are ultimately measured by replies, meetings, qualified opportunities, revenue, unsubscribes, complaints, and sender reputation. Support answers are measured by resolution, reopen rate, refund risk, and customer sentiment. Code changes are measured by uptime, conversion, latency, errors, usage, and maintenance cost.
The system should attach delayed outcomes back to the original artifact whenever possible.
5. Four evaluation channels
No single evaluator is enough. Agentic businesses need layered evaluation.
Deterministic checks
These are hard tests with clear pass/fail results:
- Does the JSON parse?
- Did the code compile?
- Did the endpoint return the expected response?
- Did the message include required compliance language?
- Did the agent make a forbidden claim?
- Did the artifact cite evidence for factual assertions?
- Did the deployment pass a live smoke test?
- Did the agent try to spend beyond budget?
- Did the agent send to a duplicate recipient?
Deterministic checks are the quality floor. They catch known failure classes cheaply and consistently.
Model-graded checks
Some qualities require judgment: clarity, relevance, tone, specificity, completeness, strategic fit, or persuasiveness. A model reviewer can grade the artifact against a rubric, compare it to exemplars, or act as an adversarial critic.
LLM judges are useful when the rubric is explicit, the output format is constrained, and the judge is calibrated against known examples. They are weak when asked for vague scores without standards.
Outcome checks
The market is the strongest evaluator.
Did the recipient reply? Did the user convert? Did the customer stop asking? Did the bug recur? Did the post earn useful attention? Did the lead become revenue? Did the feature get used? Did the refund risk drop?
Outcome checks are slower, but they matter most. The system should connect outcome data back to the artifact that caused it.
Audit checks
Audit checks ask whether the process was trustworthy:
- Did the agent have the right context?
- Did it use the right tools?
- Did it verify before acting?
- Did it leave evidence?
- Did it update the right memory, test, ledger, or rule?
- Did another agent independently review high-risk work?
- Did claimed success have observed proof?
Audit checks prevent the business from confusing activity with control.
6. The scoring model
A useful score is not just a number. It is a decision.
Each artifact should end in one of five states:
- Blocked. The artifact violates a hard constraint and cannot ship.
- Needs repair. The artifact is directionally right but misses a required bar.
- Acceptable. The artifact can ship in low-risk contexts, but should not become an exemplar.
- Strong. The artifact is good enough to reuse as a positive pattern.
- Exceptional. The artifact sets a new standard and should raise the bar.
One practical threshold model:
90-100: Ship and save as exemplar.
75-89: Ship if risk is low; log improvement notes.
60-74: Repair before external use.
40-59: Fail; create targeted test or rule.
0-39: Fail; trigger incident review.
Different artifacts need different thresholds. A draft internal memo can tolerate a lower first pass. A production deploy, payment action, security-sensitive code change, or public customer claim needs a much higher gate.
The system should also separate confidence from quality. An 85 with low confidence is not the same as an 85 with strong evidence. Low confidence should trigger more verification, not automatic trust.
7. How the system knows good from bad
The system learns quality from five sources.
Explicit rules
These are non-negotiable constraints: never leak secrets, never invent customer commitments, never spend above threshold, never send duplicate outreach, never deploy without a smoke test, never claim success without evidence.
Rules define the floor.
Gold examples
The business needs a library of known-good artifacts: high-performing emails, strong support replies, clean incident reviews, well-scoped code changes, useful research memos, effective launch posts, and good customer handoffs.
Examples define taste better than abstract instructions.
Known failures
Failures are more valuable than generic guidance because they are specific. A bad output should be converted into a durable artifact: a regression test, rule, hook, evaluator case, prompt change, checklist item, or negative example.
Known failures define the boundary.
Outcome data
Outcome data tells the system which outputs actually worked. Open rates are weak. Replies are stronger. Revenue is stronger. Retention is stronger. Production stability is stronger. Low refund rate is stronger.
Outcomes define reality.
Human correction
Human correction should not be the runtime dependency, but it is a high-quality calibration signal. When a human says, "This is wrong," the system should not just fix that artifact. It should identify the class of failure and create a defense against that class.
Corrections define new structure.
8. Turning average output into higher standards
The most important case is not catastrophic failure. It is middle-of-the-road output.
Average output is dangerous because it often passes. It is not wrong enough to trigger an incident, but it is not good enough to compound. If an agentic business ships enough average work, the brand becomes average, the product becomes noisy, and the system trains itself on mediocrity.
Middle output should trigger a different loop than failure.
Failure asks, "What must never happen again?"
Average asks, "What would have made this meaningfully better?"
The system should compare average artifacts against stronger examples and identify the missing bar:
- Was it too generic?
- Did it lack evidence?
- Did it miss the audience's real problem?
- Did it bury the point?
- Did it optimize for politeness instead of action?
- Did it complete the task but fail to create leverage?
- Did it use the right format for the wrong surface?
- Did it avoid risk so aggressively that it became toothless?
Then the system creates a new improvement target.
For outreach, that might become: every message must include one verified company-specific observation, one business consequence, and one clear ask.
For reports, it might become: every report must end with a decision, an owner, a next action, or a reason no action is needed.
For code agents, it might become: every deploy must include a rollback path and observed smoke-test result.
The bar moves when the system can name the difference between acceptable and strong.
9. Failure should leave a scar
In a mature agentic business, a failure is not closed when the immediate issue is fixed. It is closed when recurrence is harder.
Every meaningful failure should produce at least one durable change:
- A new deterministic test.
- A new evaluator case.
- A new forbidden pattern.
- A new checklist item.
- A new tool guard.
- A new memory rule.
- A new exemplar or anti-example.
- A new monitoring signal.
- A new deployment gate.
- A new human-escalation threshold.
The right question after a miss is not, "Why did the agent get this wrong?"
The right question is, "Why was the system able to ship this without noticing?"
That shifts the problem from blame to infrastructure.
10. Self-audit requires role separation
Agents need to audit their own work, but self-audit cannot mean asking the same agent whether it did a good job. That creates confirmation bias.
The maker agent produces the artifact. The reviewer agent evaluates it against the rubric. The critic agent looks for hidden failure modes. The monitor checks live outcomes. The memory process decides what should persist. The regression harness tests whether the failure can recur.
The audit trail should capture:
- Original request.
- Inputs used.
- Tools called.
- Artifact produced.
- Rubric version.
- Scores by dimension.
- Blocking failures.
- Reviewer notes.
- Shipping decision.
- Outcome metrics.
- Learning update applied.
- Follow-up checks scheduled.
This makes the system inspectable. It also gives future agents the context they need to avoid rediscovering the same lesson.
11. Evaluation by artifact type
Outbound message
Score on:
- Correct recipient and account.
- No duplicate send.
- Verified facts.
- Recipient fit.
- Specificity.
- Clear ask.
- Voice and tone.
- Policy compliance.
- Deliverability risk.
- Outcome: reply, unsubscribe, complaint, conversion.
Bad messages should update suppression lists, duplicate ledgers, voice examples, recipient targeting rules, and claim-verification checks.
Customer support answer
Score on:
- Correct diagnosis.
- Completeness.
- Customer-specific context.
- No invented policy.
- No unsupported refund, legal, or billing claim.
- Clarity.
- Resolution outcome.
- Reopen rate.
- Escalation appropriateness.
Bad answers should create new support macros, policy checks, retrieval improvements, and escalation triggers.
Code change
Score on:
- Tests passing.
- Target behavior verified.
- No regression in related paths.
- Security impact.
- Observability.
- Rollback path.
- Production smoke.
- User-facing impact.
- Maintenance cost.
Bad changes should become regression tests, static checks, deploy gates, or architectural rules.
Research artifact
Score on:
- Clear thesis.
- Evidence quality.
- Source freshness.
- Decision usefulness.
- Originality.
- Counterargument treatment.
- Audience fit.
- Reusable insight.
Weak research should update source requirements, structure templates, and exemplar libraries.
Lead score or business decision
Score on:
- Input completeness.
- Signal quality.
- Calibration against past outcomes.
- Confidence.
- Reasoning trace.
- Cost of false positive.
- Cost of false negative.
- Actual conversion or loss outcome.
Bad scores should update weighting, feature definitions, disqualification rules, and confidence thresholds.
12. Memory as the operational immune system
Agent memory is not a dumping ground for everything that happened. It is the business's operational immune system.
Memory should store reusable lessons:
- Rules that prevent repeated mistakes.
- Project-specific invariants.
- Known failure modes.
- Current operating constraints.
- High-performing examples.
- Evaluation rubrics.
- Tool gotchas.
- Open loops that matter.
Memory becomes harmful when it stores stale facts, unverified claims, or one-off context as permanent truth. A self-improving system needs memory hygiene: source links, last-verified dates, conflict detection, pruning, and expiration.
The rule is simple: memory should make future action better, not merely make past action searchable.
13. Tests convert judgment into structure
Tests are how agentic businesses turn subjective lessons into durable infrastructure.
Useful test classes include:
- Unit tests for deterministic code paths.
- Regression tests for every important failure.
- Golden artifact tests comparing output against high-quality examples.
- Rubric tests where evaluators score known-good and known-bad samples.
- Adversarial tests that attempt to trigger unsafe behavior.
- Live smoke tests that verify the deployed path.
- Outcome tests that compare cohorts over time.
The most valuable tests are not always broad. Often the highest-yield test is narrow: one failure, one invariant, one exact thing that should never happen again.
Agentic systems should add small, sharp tests continuously instead of waiting for occasional large test-suite rewrites.
14. Feedback timing
Feedback arrives at different speeds.
Immediate feedback catches structural errors: invalid JSON, broken links, missing sources, policy violations, failed compiles, failed smoke tests.
Short-cycle feedback catches quality issues: reviewer score, recipient fit, tone mismatch, missing evidence, weak specificity.
Delayed feedback catches business truth: replies, conversion, churn, revenue, usage, complaints, refunds, production stability.
The system should use all three. Immediate checks prevent obvious damage. Short-cycle checks improve quality before shipping. Delayed checks tune strategy.
The mistake is waiting for delayed outcomes to catch preventable mistakes, or relying on immediate tests to prove business value.
15. Governance without human bottlenecks
Human review cannot be the core runtime model for an agentic business. It is too slow, too expensive, and too inconsistent. But removing humans does not mean removing governance.
Governance moves into systems:
- Explicit approval thresholds.
- Risk-based gating.
- Artifact ledgers.
- Immutable audit logs.
- Tool permissions.
- Budget controls.
- External-action locks.
- Evaluation rubrics.
- Monitoring and alerting.
- Rollback paths.
Humans should intervene for new high-consequence decisions, legal identity, irreversible financial commitments, or unresolved ambiguity the system cannot safely close. Everything else should be measured, constrained, and improved by the operating system itself.
16. Implementation blueprint
Step 1: Define artifact types
List the things agents produce: emails, support replies, reports, code diffs, lead scores, posts, invoices, deploys, customer notes, and strategic decisions.
Step 2: Create a scorecard per artifact
Each scorecard should include hard gates, graded criteria, confidence, required evidence, shipping threshold, and outcome metrics.
Step 3: Store every artifact with metadata
At minimum: input, output, agent, tools, timestamp, target, risk class, evaluator result, shipping decision, and outcome links.
Step 4: Add deterministic gates first
Start with cheap checks: formatting, duplicate detection, forbidden claims, required fields, schema validation, source presence, test execution, and smoke tests.
Step 5: Add model reviewers second
Use rubric-bound reviewers for judgment-heavy surfaces. Calibrate them against known examples. Randomize order in pairwise judgments. Penalize verbosity when verbosity is not useful. Avoid letting a model family be the only judge of its own outputs.
Step 6: Attach outcome data
Connect replies, conversions, bug reports, support reopens, unsubscribes, refund requests, usage, revenue, and errors back to the artifact.
Step 7: Convert failures into tests
Every recurring miss becomes a guard. Every high-impact miss becomes an incident review. Every incident review ends with a durable system change.
Step 8: Promote exceptional work into exemplars
Do not only remember failures. Preserve the work that actually performed. Agents need positive taste, not only negative constraints.
Step 9: Periodically audit the evaluator
Evaluators drift too. Rubrics can become stale. Model judges can develop bias. Outcome metrics can reward the wrong behavior. Every evaluator should have its own tests and calibration set.
17. What this changes
The competitive advantage in agentic businesses will not come from access to the same frontier models everyone else can call. It will come from the private improvement loop around those models.
The moat is the accumulated structure:
- Artifact history.
- Scoring rubrics.
- Gold examples.
- Failure library.
- Outcome-linked dataset.
- Tests and gates.
- Operational memory.
- Audit trail.
- Ability to raise standards automatically.
Two companies can use the same model and get different results because one has a feedback system and the other has prompts.
Prompts are instructions. Feedback systems are compounding infrastructure.
18. The central standard
An agentic business should be judged by a simple test:
When an agent produces something bad, does the business merely fix that one thing, or does the business become harder to fool next time?
If the system only fixes the artifact, it is not self-improving.
If the system changes its tests, rubrics, examples, rules, memory, monitors, or gates so the failure is less likely to recur, it is learning.
The future agentic business is not a swarm of agents doing tasks. It is a business that can observe its own work, judge it, remember what mattered, and turn every miss into a stronger operating system.
That is the foundation.