Research

Why legal AI hallucinates citations, and what real verification requires

The generation side of legal AI raced ahead. The verification side did not. Lawyers keep getting sanctioned for citations a model invented. This is why fabricated citations happen, why the tool that wrote the brief is the wrong tool to check it, and what real verification requires.

10 min read

The most embarrassing failure in modern legal practice is now a recurring headline. A lawyer files a brief, opposing counsel or the judge cannot find one of the cited cases, and it turns out the case never existed. It was generated by an AI tool, in flawless citation form, and signed off on by a human who trusted it. The first widely reported instance led to sanctions. It was not the last. Courts have since issued standing orders requiring disclosure of AI use, and the sanctions list keeps growing.

This is worth understanding precisely, because the lesson most people take from it is the wrong one. The problem is not that AI is useless for legal work. The generation side has become genuinely good. The problem is that generation raced ahead of verification, and verification is a different job that almost nothing on the market actually does well. This piece is about why the failure happens, why the tool that wrote the brief is the wrong tool to check it, and what real verification requires.

Why a model invents a citation

To see why fabricated citations are so persistent, you have to see what the model is actually doing when it produces one.

A language model generates text that is statistically likely given everything before it. A legal citation has a rigid, learnable form: a case name, a reporter volume, a page, a court, a year. That form is easy to imitate. The model can produce Henderson v. Calhoun, 412 F.3d 1099 (9th Cir. 2005) that looks perfect in every respect except that no such case exists. The model was never looking the case up and confirming it. It was producing something shaped like a citation, because that is what the surrounding text called for.

This is the same root cause behind every hallucination, applied to a domain where the output happens to be checkable in principle and catastrophic when wrong. A few specific factors make legal citations especially dangerous:

  • The format hides the fabrication. A wrong total in a financial model at least looks like a number you might double-check. A fabricated citation looks exactly like a real one. Its correctness is invisible on the page.
  • The quote can be wrong even when the case is real. Often the cited case exists, but the model has paraphrased a holding it does not contain, or attached a real quotation to the wrong proposition. This is harder to catch than a fully invented case, because the citation survives a quick existence check.
  • Good law is a moving target. A case can be real, accurately quoted, and still worthless because it was reversed, vacated, or overruled. A model has no reliable, current sense of subsequent history.
  • Agents multiply the surface area. The more autonomous the tool, the more steps happen between the prompt and the final document, and the more places a bad citation can be introduced and then carried forward as if it were established.

Retrieval helps with the first of these. If the tool searches a real legal database and pulls actual cases, it is far less likely to invent one whole. This is why serious legal AI platforms are much safer than a raw chatbot. But retrieval narrows the problem rather than closing it. The model still has to read what it retrieved, characterize it correctly, and cite it for the right point, and at each of those steps it can still be confidently wrong.

Why the generator is the wrong checker

Here is the structural issue that no amount of model quality fixes on its own. The tool that drafts the brief is also, in most products, the tool that decides the brief is fine. It is grading its own homework.

Platforms like Harvey and CoCounsel are good precisely because they are tuned to produce fluent, persuasive, well-organized legal work, fast, at the scale of a large firm. That is a real achievement and it is why most of the Am Law 100 are using something in this category. But the qualities that make a tool a great drafter, confidence, fluency, the drive to produce a complete answer, are the opposite of the qualities you want in a verifier. A verifier should be skeptical, should prefer to flag uncertainty over resolving it, and should refuse to call something checked that it did not actually check.

Asking the same system to be both the persuasive advocate and the skeptical auditor is asking it to hold two contradictory objectives at once. In practice the advocate wins, because that is what the product was built and benchmarked to be. The verification, where it exists, tends to be a softer pass by the same machine that produced the text, not an independent check against the record.

This is the same lesson we have written about in financial work, where a confident wrong total is more dangerous than an honest "I could not verify this." The argument is laid out in general terms in why most AI agents fail in production, and applied to accounts payable in AI for accounts payable. Legal citations are the same failure wearing a wig: an authoritative-looking output that gets trusted because checking it is tedious and the tool sounds sure.

What real verification requires

Verifying a citation is not a vibe. It is a short list of specific checks, each with a correct answer that exists independently of anyone's opinion:

  • Existence. Does the cited authority actually exist, in the reporter, court, and year given?
  • Accuracy of the quotation or holding. Does the case actually say what it is cited as saying, in the place the pincite points to?
  • Proposition match. Is the case cited for something it genuinely supports, rather than something adjacent that it does not?
  • Pincite correctness. Does the specific page cited contain the specific language relied on?
  • Current good law. Has the authority been reversed, vacated, overruled, or superseded, in whole or for the proposition it is being used for?
  • Jurisdictional relevance. Is the authority actually controlling or persuasive for the court the document is aimed at?

Each of these is a comparison against an authoritative source, and each has a clean answer. That is exactly the kind of work that should be done as a deterministic check, run independently, with the result shown as evidence rather than asserted as a conclusion. A verifier should be able to say, for every citation in a document, which checks it passed, which it failed, and what the failing evidence is, so a lawyer can look at the same record and decide. The standard is not "the AI thinks this is fine." The standard is "here is the proof, check it yourself."

The tools that check citations, compared

Verification has become its own small category, separate from the platforms that draft. It is worth knowing what is actually on the market, because "our AI checks its own work" and "an independent tool checks the AI's work" are very different promises. Broadly, the options fall into three groups: checkers built into the drafting platform, standalone validators, and independent verification with the evidence shown.

ToolApproachWhat it checksWhere it runs
ClearbriefCite-check report inside the drafting toolWhether citations and assertions are supported, as a reportMicrosoft Word add-in
CoCounsel (Thomson Reuters)Citations linked back to their sourcesExistence and source, grounded in Westlaw and Practical LawThomson Reuters platform
Lexis+ with ProtégéShepard's VerifyValidation, status, and treatment — whether a case is still good lawLexisNexis platform
JurisCheckDeterministic validation against public databasesExistence and Bluebook formatting against CourtListener, Justia, GovInfoStandalone
CiteCheck AI (LawDroid)Free extract-and-checkExistence of cited cases in an uploaded briefStandalone, free
BriefCatchCitation validation engineFormatting and existence signals for cited authorityMicrosoft Word add-in
PerchIndependent, claim-by-claim checking with evidence attachedWhether each claim traces to a source that actually supports it; unsupported lines flaggedWeb, desktop, and CLI, over your own files

Two things stand out. First, several of the strongest options — the ones grounded in Westlaw or Shepard's — live inside the platform that also drafts, which brings us back to the grading-your-own-homework problem: better than a raw chatbot, but not an independent check. Second, the standalone validators are moving in the right direction, treating verification as a separate step against authoritative sources rather than a softer pass by the same machine that wrote the text.

Perch sits in that second camp, with two differences that matter for the checks in the list above. It runs verification independently of whatever produced the draft, and it shows the passage behind each result rather than returning a verdict, so a lawyer can look at the same record and decide. It does this across the web app, the desktop operator, and the CLI, working over the documents and sources you already have. The deeper legal-specific checks — pincite accuracy, proposition match, current good law — are the direction verification has to keep moving, and they are the standard Perch is built toward: computed and shown, not generated and trusted.

None of this means the generation tools are a mistake. If your firm needs to draft, summarize, and research at scale, that category is delivering real value, and the leaders earned their position. The point is narrower and it is the same point that runs through everything we build: generation and verification are different jobs, and the second one is the one that keeps lawyers out of trouble.

The work of confirming that every cited authority exists, says what it is claimed to say, supports the proposition it is offered for, and is still good law is verification work. It is tedious, it is exactly checkable, and it is precisely the kind of task that should be computed and shown rather than generated and trusted. That is the standard Perch is built around, deterministic checks with the evidence attached, demonstrated today in financial forensics on the financial forensic intelligence page, and held to the model bar described in how we evaluate models for verifiable work.

The test is the same one we apply everywhere, and it is the right test for any AI tool that touches a filing: not how good the draft sounds, but whether you can check the work. If that is the problem you are trying to solve, talk to us.