Show your work, or do not ship

Before you let an AI near anyone's health coverage, test the records it would inherit. Call the providers in an insurer's mental-health directory and count how many can actually see you. Then pull a random week of the insurer's denials and check each one against the coverage rules it claims to apply. Both audits have been run, by the United States Senate and by a federal inspector general, and both came back the same way. The records fail.

The scale of the machine those records feed is worth holding in your head. When 16 state insurance regulators surveyed 93 large health insurers through the NAIC in late 2024, 84 percent reported using AI or machine learning; 56 percent for utilization management, 44 percent for claims adjudication. Medicare Advantage insurers alone processed nearly 53 million prior-authorization determinations in 2024, about a hundred every minute, around the clock, all year. On the receiving end, physicians are about as close to unanimous as medicine gets: 94 percent told the AMA that prior authorization delays the care their patients need, and 29 percent said it has caused a serious adverse event for a patient of theirs.

This paper makes one claim and then defends it in detail. In benefits, an AI that cannot prove why it did what it did is a hazard, because the records it consumes and produces sit in an industry where the existing records are demonstrably, measurably wrong. Provenance cannot be a feature added later. It is the load-bearing wall. I will show you the wrongness first, exhibit by exhibit, because the architecture only makes sense once you have seen what it is built against.

And the fair objection should go on the table now: this argument comes from people who build benefits AI, so you should discount for that. What I can offer against the discount is specificity. The exact failures, with their primary sources. The exact architecture, what it refuses to do, what the rigor costs us, and what would prove us wrong.

A Keel Labs paper, written by Garrett · Figures are point-in-time and directional · Every number traces to a source at the end

80.7%

of appealed Medicare Advantage prior-authorization denials were overturned in 2024, and only 11.5 percent of denials were appealed. For most people the first answer, however wrong, was the final one.

Exhibit one

Call the number in the directory

Start with the most basic record an insurer keeps: the list of doctors it says you can see. In May 2023, Senate Finance Committee staff ran a secret-shopper study against that list. They pulled provider directories from 12 Medicare Advantage plans across six states and placed 120 calls to listed mental-health providers, asking for an appointment the way any member would. A third of the listings were inaccurate outright: wrong or disconnected numbers, or calls that no one ever returned. Counting providers who turned out to be out of network or closed to new patients, more than 8 in 10 listings were ghosts. The shoppers secured an appointment 18 percent of the time. In Oregon, across every plan tested, the success rate was zero.

Seven months later the New York Attorney General ran the same test bigger and got the same answer. Her office called 396 mental-health providers listed in the directories of 13 health plans operating in New York. 86 percent were ghosts: unreachable, not actually in network, or not taking patients. Fifty-six of the 396 offered an appointment. One plan, EmblemHealth, later paid $2.5 million in a settlement and agreed to rebuild its directory practices. The directory is a record the insurer is required by law to publish and maintain. Regulators measure network adequacy against it. People choose plans by it. And when somebody finally dialed the numbers, it failed seventeen calls out of twenty.

Figure 2 · The directory testpause

A secret shopper works the insurer's own provider list. Five of 28 calls end in an appointment, the Senate Finance rate of 18 percent. In Oregon the rate was zero. Senate Finance Committee secret-shopper study, May 2023.

A wrong directory moves money and care. RTI International examined claims for more than 22 million privately insured people from 2019 through 2021 and found patients going out of network 8.9 times more often for a psychiatrist and 10.6 times more often for a psychologist than for medical or surgical clinicians; for sub-acute behavioral inpatient care, 19.9 times. Some of that gap is the ghost directory operating exactly as you would predict: the in-network provider you were promised does not exist, so you pay out-of-network rates or stop looking. Notice the design failure underneath. Nothing in the system ever required a directory entry to prove itself. The record was asserted once, never verified, and load-bearing the entire time.

18%

of 120 secret-shopper calls to providers listed in Medicare Advantage directories ended in an offered appointment. The directory is the insurer's own published record of its network.

Exhibit two

Audit the denial against its own rulebook

The denial is a record too: a written assertion that a service is not covered under the rules. In 2024, Medicare Advantage insurers denied 4.1 million prior-authorization requests, 7.7 percent of all determinations and up from 6.4 percent the year before. Of those denials, 11.5 percent were appealed. Of the appeals, 80.7 percent were overturned, fully or partially, and that is no fluke of one bad year: in every year KFF has measured since 2019, more than eight in ten appealed denials fell. A decision that fails review four times out of five survives, overall, more than nine times out of ten, because almost nobody asks for the review.

Appealed denials are a selected sample, so the more damning evidence is the audit with no selection in it. In April 2022, the HHS Office of Inspector General took a stratified random sample of denials issued by the 15 largest Medicare Advantage organizations during a single week of June 2019 and checked each one, case by case, against Medicare's own coverage rules. 13 percent of the denied prior-authorization requests met those rules and should have been approved. Among payment denials, 18 percent. Not the strongest cases, not the squeaky wheels. Random ones. The record said "not covered," and for roughly one in eight, the record was simply false by the standard it claimed to apply.

13%

of randomly sampled Medicare Advantage prior-authorization denials met Medicare coverage rules and should have been approved, per the HHS inspector general. No appeals, no selection. The record was checked against its own rulebook and failed.

Who writes that record is changing fast, and not toward more explainability. The Senate Permanent Subcommittee on Investigations read more than 280,000 pages of internal documents from the three insurers covering nearly 60 percent of Medicare Advantage enrollees. It found UnitedHealthcare's denial rate for post-acute care climbing from 10.9 percent to 22.7 percent between 2020 and 2022, the same window in which the company rolled out initiatives to automate the process, including a working group exploring machine-learning models to predict which denials would be appealed. Humana denied post-acute care requests in 2022 at more than sixteen times its overall denial rate. These are majority-staff findings, not adjudicated facts, and post-acute care is a small, costly category where ratios run hot. The direction still matters: the most contested decisions in the system are precisely the ones being handed to automation that keeps no legible account of itself.

A denial that cannot explain itself and a directory that cannot pass a phone call are the same defect. A record was asserted, never proven, and people built their lives on it.

Exhibit three

Even the error statistic fails the trace

Here is the exhibit I find most clarifying, because it implicates everyone, this paper's authors included. The most famous number in all of medical billing is that "80 percent of medical bills contain errors." You have seen it. It appears in news features, vendor decks, hospital-bill explainers, and the marketing of more or less every billing-advocacy service in the country. We tried to trace it to a study. There is no study. The trail runs back through years of articles citing other articles and dead-ends at a bill-review advocacy business describing its own self-selected sample, with no published methodology and no denominator. The companion figures that travel with it, a "49 percent" attributed vaguely to the GAO and a $1,300-per-bill claim attributed to Equifax, dead-end the same way.

Now compare the number that survives the same walk. Every year CMS draws a stratified random sample of Medicare fee-for-service claims, pulls the underlying medical records, re-adjudicates each claim by hand against payment rules, and publishes the method. The program is called CERT, and its finding for fiscal 2025 was an improper payment rate of 6.55 percent, or $28.83 billion, down from 7.66 percent the year before. That figure includes underpayments as well as overpayments, it measures Medicare payments rather than consumer bills, and it is the only number in this genre with an audit trail. One number has a methodology and a paper trail. The other has momentum. In benefits, most of the records you will meet are the second kind.

Figure 3 · Trace the citationpause

Two famous error numbers walk backward toward their sources. The folklore trail dead-ends at an advocacy claim with no study behind it. The audited trail reaches medical records sampled at random. CMS CERT FY2025; citation trace by Keel Labs.

Sit with what CERT costs to produce, because it is the whole argument in miniature. An audited number exists for exactly one corner of the system, and it exists because the government pays an army of reviewers to re-open sampled claims and re-decide each one against the medical records, after the fact. That is provenance bolted on: heroic, expensive, annual, and covering a sample. The engineering question this paper cares about is what it takes to make every decision carry its evidence with it at the moment it is made, so that the audit is a query instead of an expedition. The folklore stat also sets a rule we apply to ourselves: a number that cannot be traced does not get repeated with a hedge in front of it. It gets cut. You will not find the 80 percent figure asserted anywhere in this paper, and every number we do assert is sourced at the end.

The pattern

Deterministic guardrails around a probabilistic model

One more number reframes what kind of problem this is. When HealthCare.gov insurers reported why they denied 85 million in-network claims in 2024, only 5 percent of the reasons involved medical necessity. Administrative reasons took 25 percent, exclusions 13 percent, missing authorizations 9 percent, and the single largest category, at 36 percent, was the one insurers themselves label "other." The machine hurting people is mostly not exercising clinical judgment. It is plumbing. And plumbing, unlike judgment, is exactly the kind of system that can be made to show its work.

Neither blanket trust in a model nor a ban on it survives contact with those numbers. The pattern that works in production is a division of labor. The model is the expert that reads and reasons over plan documents and regulations. Its output is wrapped by a versioned rule engine tied to the specific regulation it enforces, so the same input produces the same, explainable result every time, and a change in behavior can only come from a change in a named version. The model proposes. A deterministic layer and the source document dispose. That is how you get a reproducible decision out of a probabilistic system, and reproducibility is what an auditor or an angry member on the phone actually needs.

Figure 4 · Every action through the gatespause

No action reaches a person until it passes each compliance gate, and the whole path is written to an immutable log.

Self-checking, grounded

A model can check its own work, but only against something real

There is a genuine result here that is easy to get wrong in both directions. Models do improve when they critique themselves against a written principle or agree across independent reasoning paths; Anthropic's Constitutional AI and the self-consistency line of work both show it. The opposite is also documented: a model marking its own homework in a vacuum can get worse, not better. Huang and colleagues put it bluntly in 2023: intrinsic self-correction, with nothing external to check against, is not reliable. Confidence is not evidence, in models or in people.

So the synthesis we build around is simple to state and strict in practice. Self-checking works when it is grounded in something real, a plan document or a versioned rule, never in the model's own sense of certainty. We never let the model be both the author and the only judge. Every check it runs names the external thing it checked against, and that name goes into the log with everything else.

In a system that touches someone's coverage, provenance is the architecture. Add it at the end and you have built a different, worse system.

Figure 5 · The audit trailpause

Each decision appends an entry linked to the one before it, with its source and timestamp. The trail is built as the work happens, not reconstructed later.

The frontier

Provenance is becoming law

This stopped being an engineering preference and became statute while the industry was still deploying. California's SB 1120, the Physicians Make Decisions Act, in force since January 2025, says an algorithm cannot make the final medical-necessity call; only a licensed physician or a licensed professional competent in the clinical issue can. The NAIC's model bulletin on insurers' use of AI, which asks for a documented governance program rather than assurances, had been adopted by 24 states as of March 2025 and by more than half the states before the year was out. And in the turn I did not expect, CMS itself began accepting AI-screened prior-authorization requests in traditional Medicare in January 2026, under a six-state model called WISeR, with one standing rule: every recommendation not to pay must be made by a licensed clinician. The agency that publishes the overturn statistics now runs the machinery it catalogued. You can read that as irony. I read it as the strongest available evidence that this technology is staying, and that the whole regulatory argument has collapsed onto a single question: can the system prove why it did what it did? Our bar is the one the regulators are converging on. Not "the model seemed confident." Here is the rule, and here is the line in the source that satisfied it.

The limits

Where rigor costs us, and why we pay it anyway

Determinism has a price and I will not pretend it away. Guardrails that are too rigid will refuse a legitimate case that does not fit the rule's shape, and a system that escalates too eagerly buries a human reviewer in noise. Our answer is to make the boundary itself legible: when the system declines or escalates, it says exactly which rule or missing source stopped it, so a person can resolve the case instead of fighting a black box. A guardrail you can see is one you can fix. A guardrail you cannot see is just another ghost entry in another directory.

Two things would change our mind. If a frontier model someday passes a million-decision audit with no rule engine and no trace, just raw competence, the deterministic wrapper becomes ceremony and we will retire it gladly; nobody has shown that audit. And we are watching WISeR, because CMS will have to publish what its model does. If human-reviewed AI screening in traditional Medicare produces overturn numbers that look like Medicare Advantage's, the lesson will be that a clinician checkpoint bolted onto an opaque system is theater, and the trace requirement will tighten everywhere. Either way, the receipts decide. That is the point.

What Keel Labs is building

An agent that cannot take an action it is unable to prove, and writes down every move it makes.

This is the layer that makes the other papers safe to ship. The relay showed why truth about a plan is scattered. No-price and personalized enrollment showed what becomes possible once a model can see and reason over it. None of that is allowed near a person's coverage unless it can prove itself, because the industry it operates in has already shown what asserted-but-unverified records do to people. Provenance is the floor we build on, not a finishing step.

The model proposes, never disposes alone

Fathom reads and reasons, then hands its proposal to a versioned rule engine tied to the exact regulation in play. The deterministic layer makes the outcome reproducible: same input, same result, every time.

Every claim grounded to a source →

No answer leaves the system without a citation to the line it came from. A claim that cannot find its source is withheld, not guessed. The same rule applies to the numbers in our research.

Every action through the gates

PHI handling, eligibility rules, bias checks, and human review are gates an action passes before it reaches a person, not reviews that happen after the fact.

A trail built as the work happens

Each decision appends an immutable, linked log entry with its source and timestamp. CERT proves an after-the-fact audit takes an army and covers a sample. Built-in, the audit is a query.

A licensed human holds final authority

The agent advises and routes. It never denies care on its own. When it is unsure, it escalates to a person and says exactly which rule or missing source stopped it.

We have watched what unproven records do in this industry: directories that fail a phone call, denials issued at scale with no way to explain the reason to the person they happened to. We are building the inverse on purpose, and we think it is the only kind of benefits AI that should exist.

An answer you cannot trace is an answer you cannot use. So we built the tracing in first, and let everything else stand on it.

What this paper does not claimWe are not claiming proof eliminates judgment, or that a rule engine can encode every nuance of care; determinism trades flexibility for accountability and we accept that cost deliberately. The 80.7 percent overturn rate covers appealed denials, a selected group, and must not be read as "80 percent of denials are wrong"; the unselected evidence is the OIG's random sample, 13 to 18 percent, drawn from one week of June 2019 cases at the 15 largest Medicare Advantage organizations. The secret-shopper studies are small by design, 120 calls and 396 calls, and tested mental-health listings specifically, where networks are thinnest. The Senate subcommittee findings come from a majority staff report, not an adjudicated record, and post-acute care is a small, costly category where ratios run hot. The NAIC adoption figures come from a survey of 93 large insurers in 16 states, a sample rather than a census, and the 94 percent figure polls physicians, who have every reason to resent prior authorization. CERT measures Medicare fee-for-service payments against documentation and payment rules; it counts underpayments as well as overpayments, says nothing about the bills consumers receive, and cannot be read as "bills are mostly fine." The 5 percent medical-necessity share is from HealthCare.gov plans and may not generalize to employer coverage, where denial data is barely collected at all. The load-bearing claims survive the caveats: the directory audits, the appeal outcomes, the OIG sample, and the CERT rate are all checks of records against their own stated standards, and the records failed.

SourcesKFF, Medicare Advantage Insurers Made Nearly 53 Million Prior Authorization Determinations in 2024 (Jan 2026; CMS data: ~53M determinations, 4.1M denied, 7.7% vs 6.4% in 2023; 11.5% of denials appealed; 80.7% of appeals overturned; >8 in 10 every year since 2019) · HHS Office of Inspector General, OEI-09-18-00260 (Apr 2022; stratified random sample, 15 largest MA organizations, one week of June 2019: 13% of prior-auth denials and 18% of payment denials met Medicare coverage rules) · U.S. Senate Committee on Finance, majority staff secret-shopper study (May 2023; 120 calls, 12 MA plans, 6 states; 33% of listings inaccurate; >80% ghosts; appointments secured 18% of the time, 0% in Oregon) · Office of the New York State Attorney General, "Inaccurate and Inadequate" (Dec 2023; 396 listed mental-health providers called across 13 plans; 86% ghosts; 56 offered appointments) and EmblemHealth settlement ($2.5M) · RTI International, behavioral health out-of-network study (2024; 2019 to 2021 claims, 22M+ lives: out-of-network use 8.9x for psychiatrists, 10.6x for psychologists, 19.9x for sub-acute behavioral inpatient) · CMS, Fiscal Year 2025 Improper Payments Fact Sheet, CERT program (Medicare FFS improper payment rate 6.55%, $28.83B; FY2024: 7.66%, $31.70B) · "80% of medical bills contain errors": traced by Keel Labs to bill-review advocacy marketing; no study located; companion "49%"/GAO and Equifax $1,300 claims equally untraceable; treated here as folklore and not asserted · NAIC Health AI/ML Survey (fielded Nov 2024 to Jan 2025, 93 large insurers, 16 states; 84% using AI/ML, 56% utilization management, 44% claims adjudication) · AMA Prior Authorization Physician Survey, 2024 (n=1,000; 94% report delayed care; 29% report a serious adverse event) · U.S. Senate Permanent Subcommittee on Investigations, majority staff report on Medicare Advantage prior authorization (Oct 2024; 280,000+ pages; UnitedHealthcare post-acute denial rate 10.9% to 22.7%, 2020 to 2022; Humana post-acute denials >16x its overall rate, 2022) · KFF, Claims Denials and Appeals in ACA Marketplace Plans in 2024 (denial reasons: 36% "other," 25% administrative, 13% excluded service, 9% prior auth/referral, 5% medical necessity) · California SB 1120 (2024, effective Jan 2025) · NAIC Model Bulletin on AI Systems, state-adoption tracker (24 states, Mar 2025; majority of states by Dec 2025) · CMS WISeR Model (Jan 2026 to Dec 2031; AZ, NJ, OH, OK, TX, WA; licensed-clinician review of every non-payment recommendation) · Anthropic, Constitutional AI · Wang et al., self-consistency · Huang et al. 2023, "Large Language Models Cannot Self-Correct Reasoning Yet" · NIST AI RMF · ISO/IEC 42001. Figures are point-in-time and directional.

Show your work, or do not ship.

Call the number in the directory

Audit the denial against its own rulebook

Even the error statistic fails the trace

Deterministic guardrails around a probabilistic model

A model can check its own work, but only against something real

Provenance is becoming law

Where rigor costs us, and why we pay it anyway

An agent that cannot take an action it is unable to prove, and writes down every move it makes.

The model proposes, never disposes alone

Every claim grounded to a source →

Every action through the gates

A trail built as the work happens

A licensed human holds final authority

Keep reading: a denial is a decision that profits from going unchecked. Receipts are how you check it.