Research
Building AI you can verify.
Most AI agents fail in production for reasons that are now well understood. Perch is built around those failure modes, not in spite of them. This is how we think about reliable, checkable AI, how we test it, and the research that shapes it.
Writing
From the lab.
Notes on how we build verifiable AI: the failure modes we design around, how we test models, and the training research behind the product.
The problem
Why most AI agents fail in production.
Industry surveys put the failure rate for moving agent systems from prototype to production at roughly 90 percent, and Gartner has forecast that more than 40 percent of generative AI projects will be abandoned after the pilot stage. The reasons are consistent and technical.
- Hallucination cascades
- A single fabricated fact early in an agent's reasoning propagates through every later step. Unlike a chatbot that only writes text, an agent acts on the error, calling tools, editing files, or sending messages, so the damage is done before anyone notices.
- Context drift
- As history and tool output fill the context window, the model loses its original instructions and starts pursuing side goals. Quality degrades well before the hard limit, not at it.
- Unverified arithmetic
- Language models are unreliable at exact totals. In finance and accounting work, an eyeballed sum that looks right is worse than no answer, because it reads as authoritative.
- Tool failure loops
- When a tool times out or changes shape, most systems retry the same failed call or quit early. The agent rarely understands why the call failed, so it cannot recover intelligently.
- Observability gaps
- When a multi-step run fails, ordinary logs cannot say which tool returned bad data or where a claim was invented. Teams that cannot see the failure cannot fix it.
- Runaway cost
- Without bounded loops and budget discipline, an agent can search, read, decide it needs more, and search again until the budget is gone, with nothing to show for it.
These are not hypothetical. A customer-service agent at Klarna was publicly reported to drift on long threads and fabricate refund policies, prompting a partial rollback. Air Canada was held liable by a tribunal after its support agent stated a bereavement fare policy that did not exist, establishing that an agent's hallucination can carry real legal consequences. A delivery agent run by DPD was pushed off-script through prompt injection and produced profanity in a customer chat. The pattern is consistent: when an agent acts on something it cannot support, the cost lands on the business.
Our approach
How Perch is built differently.
Each failure mode above has a design answer. Verifiability is not a feature bolted on at the end. It is the constraint the whole system is built around.
- Every claim carries its source
- Perch ties each factual statement to the material it actually read, and marks anything it cannot support rather than smoothing over it. Verification is part of the answer, not a step you run afterward.
- Numbers are computed, not guessed
- For totals, reconciliations, and other exact work, Perch prefers deterministic computation over model arithmetic, so a figure can be traced to how it was produced.
- State-changing actions leave receipts
- Anything that creates a document, sends a message, or changes a file is recorded as a reviewable action. The work is auditable after the fact, not a black box.
- Bounded loops and honest failure
- Perch is built to stop, report what it did and did not do, and surface uncertainty, rather than loop or pretend an action happened. Telling the truth about a partial result is part of the design.
How we choose models
Models earn their place on real work.
Perch does not pick a model on vibes or leaderboard scores. Every candidate runs through a standardized harness built from realistic professional tasks: accounts-payable audits, statement and earnings reviews, reconciliations, and know-your-customer checks, each with issues planted in the source files so we know the correct answer in advance.
We score each run on grounded accuracy, whether the right tools were used, the quality of the evidence and citations produced, how the model recovers from missing inputs and tool failures, and cost discipline. Hallucinated tool results, invented figures, and pretending an action happened are penalized hardest, because in professional work a confident wrong answer is the most expensive outcome. A model ships only when it grounds its answers and uses tools honestly under this pressure.
Domain adaptation
Training for the work, not the demo.
Beyond evaluating models, we run our own training research. We have fine-tuned open base models for specific financial domains: a LoRA adapter on Llama 3.1 8B for equity and volatility analysis, trained with parameter-efficient methods on a curated, real-world market dataset, and domain-adaptive pretraining experiments on Qwen. The goal is narrow and practical: better grounding, cleaner structured output, and steadier behavior on the kinds of documents professionals actually work with.
This research informs how Perch routes and grounds work today, and it is how we keep improving the parts of the pipeline that matter most for verifiable, decision-useful output.