Why most AI agents fail in production

Industry surveys put the failure rate for moving agent systems into production near 90 percent. The reasons are consistent and technical. Here is what breaks, and why.

June 10, 20267 min read

Most AI agent projects do not make it to production. Industry surveys put the failure rate for moving agent systems from prototype to production at roughly 90 percent, and Gartner has forecast that more than 40 percent of generative AI projects will be abandoned after the pilot stage. The interesting part is that the failures are not random. They cluster around a small set of technical problems that show up again and again.

This is the lens Perch is built through. Naming the failure modes precisely is the first step to designing around them.

The failure modes

Hallucination cascades. A single fabricated fact early in an agent's reasoning chain propagates through every step that follows. A chatbot that invents a detail only produces text. An agent acts on it: it calls a tool, edits a file, or sends a message based on the false premise. The error becomes an action before anyone can catch it.

Context drift. Agent frameworks accumulate conversation history and tool output into one growing context window. As the window fills, the model loses reliable access to its original instructions and constraints, and it starts pursuing side goals. Quality degrades well before the hard token limit, not at it.

Unverified arithmetic. Language models are unreliable at exact computation. In accounting, audit, and financial work, a total that looks right but is wrong is worse than no answer, because it reads as authoritative and gets trusted.

Tool failure propagation. When an external tool times out, hits a rate limit, or changes its response shape, most frameworks offer thin error handling. The agent rarely understands why the call failed, so it retries the same failing call in a loop or abandons a valid path entirely.

Observability gaps. Agent reasoning is non-deterministic and opaque. When a multi-step run fails, ordinary logs cannot tell you which tool returned bad data or where a claim was invented. A team that cannot see the failure cannot fix it.

Runaway cost. Without bounded loops and budget discipline, an agent can search, read, decide it needs more information, and search again until the budget is gone, with nothing useful produced.

Prompt injection. Agents that read external content are exposed to indirect prompt injection, where instructions hidden in a retrieved document hijack the agent's behavior. Few systems sanitize input or enforce permission boundaries at the agent level.

These are not hypothetical

The cost of these failures is already on the public record.

A customer-service agent at Klarna was reported to drift on long conversation threads and fabricate refund policies, which contributed to a partial rollback of the deployment.

Air Canada was held liable by a tribunal after its support agent stated a bereavement fare policy that did not exist. A customer relied on the fabricated policy, and the airline was found responsible. The decision established a clear precedent: an agent's hallucination can carry real legal consequences for the business that deployed it.

A delivery agent operated by DPD was pushed off-script through prompt injection and produced profanity in a customer chat, a public demonstration that these vulnerabilities are exploitable today, not in theory.

The pattern is consistent. When an agent acts on something it cannot support, the cost lands on the business that shipped it.

What reliable systems do differently

None of this means agents do not work. It means the reliable ones are built around these failure modes rather than hoping to avoid them.

Tie every claim to a source, and mark what cannot be verified. Verification belongs inside the answer, not in a review step you run afterward and hope someone does.
Compute exact numbers, do not generate them. Totals and reconciliations should run through deterministic computation so a figure can be traced to how it was produced.
Gate state-changing actions and leave receipts. Anything that sends, deletes, or writes should be reviewable after the fact, so an automated step cannot quietly do something irreversible.
Bound the loops and tell the truth about partial results. Stopping and reporting what was and was not done is more useful than looping or pretending an action succeeded.
Instrument everything. Structured traces of tool calls, parameters, and reasoning are what make failure debuggable instead of mysterious.

Perch is built on these principles. The rest of our research covers how we put them into practice, from how we evaluate models for this kind of work to the training research behind it.

Back to research