Can a general AI assistant like Copilot or Cowork do accounts payable work?

A general assistant can summarize an invoice or draft an email about it, but it should not be trusted to audit accounts payable. The core AP checks are exact arithmetic and matching against source documents, and general assistants produce confident, plausible, and sometimes wrong answers with no evidence trail and no guarantee that the same file returns the same result twice. For work that gets paid, that is disqualifying.

What checks does accounts payable actually require?

At a minimum, three-way matching of invoices to purchase orders and payments, duplicate invoice and payment detection, vendor master validation, approval threshold and over-approval checks, payment mismatch detection, and aging and reconciliation against the general ledger. Each one is a precise comparison with a correct answer, not an open-ended question.

How is Perch different from AP automation tools like Tipalti, Stampli, or AppZen?

Workflow tools such as Tipalti, Stampli, and Bill are strong at routing, capturing, and approving invoices. AppZen applies machine learning to expense and AP audit. Perch focuses on running the actual checks deterministically, computing the totals and matches as code rather than as a model guess, and returning each exception with the evidence behind it so a person can verify it.

Is the math in Perch produced by the language model?

No. The language model orchestrates the work and explains the findings, but the matching and arithmetic are computed deterministically. That is the point. A figure that looks right but was generated by a model is exactly the failure mode that makes AI untrustworthy for finance.

Research

AI for accounts payable: assistants, AP tools, and what deterministic checking does differently

Accounts payable looks like a reading-and-arithmetic problem, which is exactly where current AI is least reliable. Here is how the existing tools approach it, why a general chat assistant is the wrong instrument, and what changes when the checks run as code.

June 27, 202610 min read

Accounts payable is one of the first places companies try to put AI to work, and one of the first places it quietly fails. The work looks simple from the outside. Read an invoice, match it to a purchase order and a payment, check that the numbers agree, and flag what does not. In practice it is a long sequence of exact checks run against a pile of inconsistent documents, and exactness is precisely where today's AI is least reliable. A general assistant that is wrong about a total still sounds completely sure of itself, and in finance a confident wrong answer is worse than no answer, because it reads as authoritative, gets trusted, and gets paid.

This piece lays out what the work actually demands, how the existing tools approach it, why the new wave of general assistants is the wrong instrument for it, and what changes when the checks are computed rather than guessed.

What accounts payable actually requires

Strip away the software and AP audit is a set of specific comparisons, each with a correct answer that exists before you start looking:

Three-way match. Does the invoice agree with the purchase order it cites and the payment that cleared against it, on quantity, price, and total? Flag exactly where the three break.
Duplicate detection. Is this invoice, or this payment, a repeat of one already in the system under a slightly different number, date, or vendor spelling? Duplicates are one of the most common paths to cash leakage.
Vendor master checks. Is this invoice from a vendor that actually exists in the approved vendor master, or from one that was never set up? Invoices from unknown vendors are a classic fraud and leakage signal.
Approval and over-approval. Was the invoice approved by someone with authority, and does it exceed the purchase order or threshold it was approved under, even when the vendor total looks ordinary?
Payment mismatches. Does the payment that went out match the invoice it was posted against, and if not, what is the variance and which direction does it run?
Aging and reconciliation. Which payables are past due and how exposed is the company, and does ledger activity tie out to the underlying documents?

The thing these share is that each one has a known-correct answer. A duplicate is either present or it is not. A total either reconciles or it does not. That is what makes the bar so high. The job is not to produce a reasonable-sounding summary. It is to be right, and to show the evidence that proves it.

The tools already in this space

The market is not empty, and it helps to be precise about what each category is good at before talking about where the gap is.

AP automation and workflow. Tools like Tipalti, Stampli, Bill, and MineralTree are built around capturing invoices, routing them for approval, and getting them paid. They are genuinely good at the operational pipeline: optical character recognition on invoices, approval chains, vendor onboarding, payment execution. Their checking is mostly rules and configured controls. They are strong at moving an invoice from inbox to payment, and they are not designed to act as an adversarial auditor hunting for the issue someone tried to hide.
AI-based spend and AP audit. AppZen is the clearest example of machine learning applied to expense and AP audit at scale. This is the category closest to forensic work, and it is an established product. The tradeoff is that the scoring is largely a black box. You get a risk signal, but the path from document to verdict is not something a controller can fully reconstruct and defend line by line.
Duplicate-payment and AP-control specialists. Xelix, FISCAL Technologies, and Illumis focus on exactly the leakage checks general tools treat as an afterthought. Xelix scores hundreds of data points per invoice to catch duplicates, posting errors, and overpayment risk before the pay run; FISCAL Technologies checks supplier and invoice data continuously to prevent and recover duplicates; Illumis uses fuzzy and phonetic matching on the vendor master to surface inconsistencies that ERP rules miss. They are the most direct answer to finding duplicate invoices, and the main tradeoff is the same black-box-scoring limitation, a flag you cannot always fully reconstruct back to source.
Close and reconciliation suites. BlackLine, Trintech, and FloQast live one step downstream, in the financial close and general ledger reconciliation. They are enterprise systems, priced and deployed accordingly, focused on reconciliation, close management, and controls rather than invoice-level forensic checking.

None of this is a knock on those products. If your problem is routing thousands of invoices a month through approvals, an automation platform solves it. If your problem is closing the books across dozens of entities, a close suite solves it. The point is narrower: none of these is built to read your files, run the actual AP checks as computation, and hand back each exception with the evidence attached, which is a different job from any of them.

The general assistant wave is the wrong instrument

The newer pitch is that you do not need any of the above, because a general AI assistant can just do it. Microsoft Copilot, Anthropic's Cowork, and ChatGPT are all now positioned, directly or by implication, as tools that finance teams can point at their work. For drafting, summarizing, and answering questions, they are genuinely useful. For accounts payable audit, they are the wrong instrument, and it is worth being specific about why, because the reasons are technical and they do not go away with a better prompt.

Unverified arithmetic. Language models are unreliable at exact computation. They will produce a total that looks right and is wrong, and they will present it with the same confidence as a correct one. In AP, the total is the whole point. A number you cannot trust is not a smaller version of the answer. It is not an answer.
Hallucinated results. A general assistant will, under pressure, claim it checked a file it did not open, or report a match it did not actually compute. In a chat that only produces text, this is an annoyance. In AP work that decides what gets paid, it is an action taken on a fabricated premise.
No evidence trail. Ask a general assistant why it flagged something and you get a fresh paragraph of reasoning generated after the fact, not a reference to the specific rows and documents that drove the result. A controller cannot defend a finding they cannot trace back to source.
Non-determinism. Run the same invoices through a general assistant twice and you can get two different answers. For a process that has to be repeatable, auditable, and defensible, an instrument that does not return the same result for the same input is unusable, regardless of how good any single run looks.
Context drift. As the conversation and the document pile grow, a general assistant loses reliable hold on its original instructions and starts wandering. Quality degrades well before any hard limit, and it degrades silently.

We have written about these failure modes in general terms in why most AI agents fail in production. Accounts payable is where they stop being abstract. Every one of them turns directly into a payment that should not have gone out, or an exception that should have been caught and was not.

So the honest framing is not that Perch is a better Copilot. It is that a general assistant is the wrong category of tool for this, in the same way a brilliant generalist with no calculator and no paper trail is the wrong hire for an audit. The right comparison is not who writes the nicer summary. It is who can be trusted with the number.

What deterministic checking does differently

The alternative is to stop asking the model to be the thing that computes the answer, and to make it the thing that orchestrates and explains while the checks themselves run as code.

In practice that means a few things hold at once:

The checks are computed, not guessed. Three-way match, duplicate detection, vendor validation, over-approval, payment variance, and reconciliation run as deterministic logic over your actual files. The total is computed. The match is computed. The model does not get a vote on the arithmetic.
Every exception carries its evidence. A flag is not a sentence asserting that something is wrong. It is the specific records, side by side, with the variance quantified, so a person can look at the same evidence and agree or overrule it.
The same input returns the same answer. Because the checking is deterministic, the process is repeatable and defensible, which is the baseline requirement for anything that touches money and gets reviewed.
The model does the part it is actually good at. Reading messy documents, routing the work to the right check, and writing up the findings in plain language. It explains the result. It does not invent it.

This is also why we are careful about which models we let near the work at all. A model that fabricates a tool result or a figure cannot be part of this, no matter how well it scores on a public leaderboard, which is the whole subject of how we evaluate models for verifiable work. The checks Perch runs, and the datasets they are benchmarked against, are described on the financial forensic intelligence page. Those benchmarks use realistic corpora with issues planted on purpose, so there is a known-correct answer to grade against, rather than customer outcomes we cannot show you.

Where each one fits

It is worth being plain about this rather than pretending one tool does everything.

If your problem today is operational volume, getting a large stream of invoices captured, approved, and paid, an AP automation platform is built for exactly that and Perch is not trying to replace your accounts payable workflow. If your problem is the enterprise financial close across many entities, a reconciliation suite is the category you are shopping in.

The gap Perch is built for sits between and underneath those. It is the part where someone still has to actually run the checks, find the duplicate, catch the over-approval, prove the reconciliation, and stand behind the result with evidence. That work today is either done by hand in spreadsheets, bought as a black-box risk score, or, increasingly and dangerously, handed to a general assistant that will answer confidently and cannot be checked. Doing it as deterministic computation, with the model explaining rather than inventing, is the difference between an answer that sounds right and one you can defend.

That is the bar we hold ourselves to, and it is the right bar to hold any AI finance tool to before it touches a payment: not how fluent the summary is, but whether you can check the work.

If you want to see the specific checks and the benchmarks behind them, start with the financial forensic intelligence overview, or talk to us about a team rollout.

Back to research