We pre-registered a study and report what we found, including what did not work.

In March 2026 we pre-registered a validation study on the Open Science Framework, before collecting any data, with independent academic oversight. Twenty publicly traded U.S. companies across four analyst-coverage tiers. For each, a frontier model generated a detailed company profile from training knowledge alone. We then verified those profiles two ways: single-model self-check by three independent frontier models, and the full adversarial protocol.

What self-check missed

Self-check error counts varied by up to 9.6:1 on the same document depending on which model did the checking. The harshest checker passed three companies' profiles as error-free; adversarial verification later identified 16 verifiable errors across those three documents.

What the Gauntlet caught

Fabricated executives at two companies. Generated profiles named senior executives who do not exist, including a Chief Growth Officer. Caught against corporate filings.
An invented settlement. A profile described a nine-figure settlement between two companies that never occurred. Caught against court records.
A misidentified officer. A CFO named who had been succeeded; the successor appears in the company's press releases and leadership page. Caught against the corporate record.
Market share roughly double the documented figure. Caught against market analyses.
A four-fold overstatement. Research and development spending asserted at roughly four times the documented figure. Caught against the filings.
Arithmetic that does not compute. Plain arithmetic errors, including a financial walk asserting 1.2 + 6.6 = 12.5. Caught by deterministic math verification.

Phase 2 in numbers

In the study's second phase, the improved pipeline examined 1,829 claims across the twenty companies and surfaced 84 raw debunked claims, an estimated 65 to 75 genuine after accounting for documented false-positive categories. Most claims survive verification: 79.7% confirmed. The system is built to find the ones that should not.

What we tell you that vendors usually do not

The precision figure above is an engineering estimate, not yet a human-verified number; independent human evaluation is in progress. One of our six pre-registered hypotheses failed on its registered metric, and we report it as registered rather than reframing it. One of our originally reported catches turned out to be a false positive in our own pipeline; we found it because we ran the Gauntlet on our own paper, and we corrected our numbers downward. The full limitations section comes before the results section in the paper, on purpose.

Data availability

The study was pre-registered on the Open Science Framework before data collection. The complete data package, raw transcripts, result files, and all self-verification runs, is available from the authors on request, with public release planned to accompany the final version of the paper.

SELF-VERIFICATION

We run it on our own work.

Before releasing our validation study, we submitted the paper itself to the Gauntlet: five drafts, the assembled manuscript, a run on author commentary, and a post-draft check in June 2026. Every run returned a failing grade. We published the breakdown anyway, because the breakdown is the point.

The system caught four real errors in our own work before the public ever saw them: a citation that did not survive verification, a false positive in our own headline results (we cut our Phase 1 catch count from 27 to 26), an imprecise description of prior research, and an arithmetic inconsistency in our own abstract that six prior runs and multiple human passes had missed.

It also misfired, and we catalogued every category of misfire: entity-resolution failures, knowledge-cutoff false positives, phantom claim extraction, cached verdicts replayed as fresh evidence. Eight documented failure modes, published in the paper, because anyone using a verification system in production deserves to know exactly how it can be wrong.

If a verification vendor has never told you how their system fails, ask why.