How we evaluate models for verifiable work

A benchmark is only useful if it predicts real behavior. Ours is built from realistic financial tasks with known-correct answers, and it penalizes confident wrong answers hardest.

June 14, 20266 min read

Public leaderboards measure things that rarely predict how a model behaves inside a real product. A model can top a reasoning benchmark and still fabricate a tool result, miscount a column of invoices, or loop on a failed browser action. For verifiable professional work, those behaviors are disqualifying, so we test for them directly.

Every model we consider runs through a standardized harness before it goes anywhere near production.

Realistic tasks with known answers

The harness is built from the kinds of work professionals actually do: accounts-payable audits, statement and earnings reviews, ledger reconciliations, and know-your-customer checks. Each scenario uses a realistic corpus of files with issues planted in them, so we know the correct answer in advance. A duplicate invoice is in there on purpose. A figure that contradicts another document is in there on purpose. A missing approval is in there on purpose.

This matters because grading against a known answer key is the only way to separate a model that found the real issue from one that produced a confident, plausible, wrong one. Open-ended prompts cannot tell those apart. Planted issues can.

What we score

Each run is scored across several dimensions, not a single number:

Grounded accuracy. Did it find the real issues, with totals that reconcile, citing the evidence it used, without inventing files or results?
Tool and workflow routing. Did it choose the right path for the task and complete the workflow, instead of improvising around the tools that exist?
Evidence quality. Did it produce reports, tables, and source references a finance or legal user could actually trust and check?
Recovery and autonomy. Did it handle missing inputs, tool failures, and partial results without looping or quitting early?
Cost discipline. Did it reach the answer without runaway tool loops, duplicate scans, or needless context churn?

The heaviest penalties

The scoring is deliberately asymmetric. The largest penalties are reserved for the failures that do the most damage in real work:

Hallucinated tool results or pretending an action happened.
Invented figures presented as if they were computed from the source.
Claiming a file was read or a document was created when it was not.

In professional work, a confident wrong answer is more expensive than a hedged uncertain one, because it gets trusted and acted on. The harness reflects that. A model that says "I could not verify this" scores better than one that fabricates a clean-looking result.

The bar for shipping

A model earns a place in Perch only when it grounds its answers and uses tools honestly under this pressure, across the full battery, not on a single lucky run. We also track token and dollar cost per run, because a model that produces good work at a cost that does not scale is its own kind of failure.

The point of all of this is simple. The promise of Perch is that you can check the work. A model that cannot be trusted to tell the truth about what it did, even when the honest answer is "I am not sure," cannot keep that promise, no matter how well it scores elsewhere.

For the design principles this evaluation supports, see why most AI agents fail in production.

Back to research