GauntletScore

GAUNTLETSCORE

A trust score backed by math, not opinion.

GauntletScore verifies the claims in a document against primary sources, then computes a Bayesian posterior probability that they hold, reported with the evidence behind every point.

SEC 10-K · Sample FilingVerified 3 minutes ago

94ScoreGrade A

Certificate

GS-2024-09847

Claim Verdicts

Revenue increased 8.2%VERIFIED

Acquisition of Acme CorpDEBUNKED

Gross margin 54.3%VERIFIED

Named executive matches filingVERIFIED

Source Badges

SEC EDGAReCFRCourtListenerPubMed

Signature: Ed25519 verified7 agents · 4 providers

THE SCORE

The score is a probability, derived in one line of math.

log-odds(posterior) = log-odds(prior) + Σ_i ( w_i · log LR_i )

Every verified claim contributes one term. The prior is the baseline for the document type. Each term is a likelihood ratio weighted by the quality of the source that produced it. The sum, passed through a logistic function, is the score.

Deterministic.

The same evidence always produces the same score, to the digit. The math has no moods and no temperature setting.

Auditable.

Every point traces to a specific claim, verdict, source, and likelihood ratio. "Why this score" has a real answer.

Bounded.

Cromwell's rule: no finite evidence reaches certainty. The score reports a 95% interval that widens when evidence is thin.

WHY IT CAN BE TRUSTED

The AI is the witness. The math is the judge.

A second model asked to check the first produces an opinion that cannot be audited, with errors correlated to the ones it is checking. GauntletScore removes the model from the verdict entirely.

EVIDENCE LAYER

Seven agents, four model providers. Each claim is checked against primary sources: SEC and EDGAR filings, court records, regulatory databases, the medical literature. Agents disagree, escalate, and produce verdicts. This is the sensor array, and it is the part anyone can build.

SCORING LAYER

Deterministic math only. The verdicts become a Bayesian posterior. No language model runs here. An automated integrity test fails the build if anyone wires one in.

Surviving a debate is not the same as being verified. Other tools end at the answer that won the argument. GauntletScore continues, to an auditable probability with the evidence behind it.

Grounded tells you where it came from. Verified tells you whether it's true.

Verifies against: SEC EDGAR · CourtListener · eCFR · PubMed · live web search · deterministic math verification

WHY THIS IS NECESSARY

Asking another model is not an independent check. We measured it.

Models trained on overlapping data share blind spots, so a second model confirms the first one's error with equal confidence. A second checker cannot be another model's opinion. It has to be deterministic math, or it inherits the first one's blind spots. We ran the experiment: twenty AI-generated company profiles, three independent frontier checkers, identical instructions.

9.6:1

On the same document, one frontier model flagged 48 errors and another flagged 5. Neither has a claim to being correct. The choice of checker is an unmanaged variable.

0 → 16

The harshest checker returned zero errors on three documents. Adversarial verification against primary sources then found 16 verifiable errors in them. A clean self-check report is not evidence that a document is clean.

Their answers survive an argument. Our claims survive the evidence.

HOW IT WORKS

From document to signed certificate in minutes.

Submit

Upload a document or send it through the API.

Gather evidence

Seven agents from four providers check each claim against primary sources, then defend their findings against adversarial challenge. The agents gather evidence; they do not set the score.

Score

The verdicts are combined by a deterministic Bayesian engine into a trust score, per-claim verdicts, and a cryptographically signed certificate. Typically within 5 to 10 minutes.

THE SECOND PILLAR

It tests whether the reasoning holds, not just whether the facts check out.

A document can be built from individually true facts and still argue something false. GauntletScore runs a dedicated analysis over every cause-and-effect claim, testing temporal order, proportionality, confounders, and logical structure. The result enters the same posterior with a deliberate asymmetry: sound reasoning earns only a small positive weight, because internal consistency is not external proof. Broken reasoning counts heavily against the document. Most AI operates at the level of association, what correlates with what. The causal pass operates at the levels of intervention and counterfactual: what changes what, and what would have happened otherwise.

WHAT IT CATCHES

Errors that survive a single-model check.

Every example below is a documented catch from our pre-registered validation study, verified against primary sources.

A fabricated executive.

A generated profile named a Chief Growth Officer who does not exist. Caught against corporate filings.

An invented corporate event.

A profile described a nine-figure settlement between two companies that never occurred. Caught against court records.

A four-fold overstatement.

Research and development spending asserted at roughly four times the documented figure. Caught against the filings.

A misidentified officer.

A CFO named who had been succeeded; the real appointment is in the company's press releases. Caught against the corporate record.

Arithmetic that does not compute.

A financial walk asserting 1.2 + 6.6 = 12.5. Caught by the deterministic math verifier.

SOLUTIONS

Built for regulated industries

Purpose-built verification workflows for the industries where trust matters most.

Legal

Verify case citations against CourtListener. Flag fabricated holdings, incorrect regulatory references, and jurisdiction errors before filing.

Financial Compliance

Check executive names, M&A events, and financial figures against SEC EDGAR filings. Verify arithmetic in balance sheets and projections.

Insurance & Fraud

Validate billing codes, provider credentials, and claims arithmetic. Detect upcoding patterns, excluded providers, and geographic impossibility.

More Domains

Scientific research, VA disability claims, biotech regulatory filings, and general document verification. Same engine, domain-specific tools.

COMPARISON

How GauntletScore compares.

Built on independent verification, not single-model confidence.

General AI

RAG + Tools

GauntletScore

Multi-provider verification

Yes

Adversarial cross-examination

Yes

Primary source verification

Partial

Yes

Per-claim verdicts

Yes

Cryptographic proof

Yes

Air-gapped deployment

Yes

Audit-ready artifacts

Yes

VALIDATION DATA

Pre-registered, measured, and reported.

840+

verification runs

9.6:1

variance in self-check error counts across frontier models on the same document

tool-verified errors caught in Phase 1 that no self-check found

1,829

claims examined across 20 companies in Phase 2

96%

local on-premises score parity with cloud deployment (Phase 1 measurement)

Pre-registered validation study of 20 public companies, conducted with independent academic oversight. Methodology registered on the Open Science Framework, March 2026. Study in progress; final paper forthcoming.

THE CERTIFICATE

Change histories are editable. Signatures are not.

Frontier model cards now document models concealing actions and editing change histories so the changes would not appear in the record. The audit trail you need is one that neither a human nor a model can quietly rewrite. Every Gauntlet Report ships with a cryptographically signed, tamper-evident certificate recording what was checked, against which sources, and with what result. Hand it to a reviewer, a regulator, or opposing counsel; anyone can verify it has not been altered.

SOVEREIGN EDITION

Private Preview

Air-gapped deployment. Zero data egress.

Run the same verification engine on your own hardware for absolute data sovereignty.

Air-gapped or VPC deployment

Runs on your own hardware. No cloud API calls. No data leaves your network.

Zero data egress

Complete control over your infrastructure and verification processes.

Customer-managed keys

You control the cryptographic keys and all signing operations.

96% score parity

In our Phase 1 measurement, the local deployment retained 96% of cloud score quality, running entirely inside the network perimeter.

CMMC 2.0 ready

Architecture maps to CMMC 2.0 controls for government compliance.

Same Score. No Cloud.

Absolute Data Sovereignty

PRICING

Simple pricing. No subscriptions.

Pay per credit. Credits never expire. Every analysis includes full 7-agent debate, all document types, cryptographic certificate, and full transcript.

Free

3 credits included

One-time. No credit card required.

Starter

$29

10 credits

$2.90/credit

Perfect for initial testing.

Pro

$69

25 credits

$2.76/credit

Best for regular use.

Business

$125

50 credits

$2.50/credit

For high-volume verification.

Longer or multi-part documents may require more than one credit.

Enterprise and Sovereign Edition pricing available. Contact sales@genstrata.com

View full pricing →

FAQ

Frequently Asked Questions

Everything you need to know about GauntletScore and trust verification.

What counts as one credit?

One credit = one document submission. Each document is processed by 7 agents from 4 providers, cross-referenced against authoritative sources, and returned with per-claim verdicts and a signed certificate. Longer or multi-part documents may require more than one credit. Free tier includes 3 credits, one-time.

What document types do you support?

PDFs, plain text, and Markdown. SEC 10-Ks, legal filings, earnings calls, research papers, insurance claims, biotech filings, VA disability claims, anything with verifiable claims. Submit via API, batch, or web upload.

Do you store my documents?

No. Documents are encrypted in transit, processed in memory, and deleted immediately after analysis. You receive a signed certificate of verification. Sovereign Edition customers can run air-gapped local deployment with zero data egress.

How do you handle disagreement between agents?

When agents cannot reach consensus against the evidence, the claim is returned as INCONCLUSIVE, with the disagreement shown. Transparent uncertainty is more useful than false confidence.

How long does verification take?

Typical documents complete in 5-10 minutes. Each analysis triggers between 100 and 200 API calls to authoritative databases across 4 rounds of adversarial debate. Longer or multi-part documents take proportionally longer.

What about privacy and compliance?

HIPAA BAA available. GDPR DPA with SCCs available. Architecture maps to SOC 2 and CMMC 2.0 controls. All documents deleted after processing. Ed25519-signed certificates provide cryptographic proof of verification for audit trails. Sovereign Edition runs air-gapped on your own hardware with customer-managed encryption keys and zero data egress.

How do I know the agents aren't just inventing their own errors?

GauntletScore's verdicts are not based on what agents believe or estimate. During the tool verification phase (Round 0), each agent issues direct, structured queries to authoritative databases (CourtListener, SEC EDGAR, eCFR, PubMed, and others) and receives structured responses. A claim is DEBUNKED when a database returns a record that contradicts it, or when the authoritative source returns no matching record for a citation that should be there. A claim is VERIFIED when the authoritative source confirms it. The debate rounds that follow are structured around that external evidence, not around agent opinion. The full source citation for every tool query is preserved in the audit transcript, so you can verify the database result yourself.

What happens when two agents disagree?

The structured debate is designed to surface and resolve disagreements, but not to manufacture false consensus when the evidence is genuinely ambiguous. If agents hold opposing positions after four rounds of debate and neither side can produce controlling external evidence, the claim is returned as INCONCLUSIVE. This is a distinct verdict in the scoring system, not a fallback. An INCONCLUSIVE verdict tells you that the claim could not be confirmed or refuted against available authoritative sources, which is materially different from both VERIFIED and DEBUNKED, and more useful than a false confidence score that buries the uncertainty.

Can I use GauntletScore on documents that contain confidential information?

The Cloud Edition processes documents in memory during analysis. The original document text is not stored; only its SHA-256 hash is retained for certificate verification. Temporary files are deleted after results are stored, and each analysis runs in an isolated subprocess whose memory is fully reclaimed by the OS on exit. Tenant data is isolated at the database level through PostgreSQL Row-Level Security. For organizations that cannot send documents through any external API, the Sovereign Edition runs the complete GauntletScore pipeline on your own hardware with no external network dependencies.

How is this different from running my document through multiple chatbots?

Chatbots reason about claims. GauntletScore verifies them against authoritative databases. There is no amount of reasoning that produces the same result as a live CourtListener query confirming whether a case citation resolves to a real decision.
Chatbots don't challenge each other with structured evidence. GauntletScore's four-round debate structure forces agents to defend their findings against adversarial challenge. Round 2 is specifically designed so that the adversarial skeptic challenges every significant claim and demands evidence.
The adversarial skeptic's causal_validate pipeline has no equivalent in standard chatbot review. When a document makes a causal claim, the adversarial skeptic evaluates its temporal, proportional, and logical structure, not just whether the stated facts are individually accurate.
Bayesian confidence calibration produces scores that reflect evidentiary weight, not chatbot confidence language. “I'm fairly confident this is accurate” and a calibrated 0.91 confidence score are not the same thing.
The knowledge graph means that every organization's second, tenth, and hundredth run benefits from verified facts accumulated in prior runs. Chatbots have no persistent memory of what they've verified before.
Every GauntletScore analysis returns an Ed25519-signed cryptographic certificate. If you need to demonstrate to a court, regulator, or compliance auditor that a specific document was independently verified, the certificate provides that proof. A chatbot conversation log does not.

What's your error rate?

Our pre-registered validation study of 20 public companies caught 26 tool-verified factual errors in Phase 1: cases where an authoritative database directly contradicted a claim. The count was 27 until we ran the system on our own paper and it caught one of our own examples as a false positive. We corrected our number downward and documented the discovery. What the system catches well: structurally verifiable claims, citations, arithmetic, regulatory references, executive names and titles. What the system does not catch: errors requiring deep domain expertise unavailable in any public database, claims that are technically accurate but selectively framed, and arguments that are logically structured but strategically misleading. We return INCONCLUSIVE rather than forcing a verdict when evidence is insufficient.

Won't better models make verification unnecessary?

The opposite. Orchestration keeps getting absorbed into the models; verification cannot be, because independence requires staying outside the system being verified. Model cards already document frontier models silently degrading their own outputs and concealing actions. A tool cannot be the auditor of its own work. The better the models get, the more the verification layer must remain external.

My AI tool already shows its sources. Isn't that verification?

Source-grounding tells you where a claim came from; it does not tell you whether the claim is true, or whether the source says what the output claims it says. Grounded tells you where it came from. Verified tells you whether it's true. GauntletScore checks each claim against the primary source and reports the verdict with the evidence.

Has the system's precision been independently verified?

Not yet, and we will not claim otherwise. Our current estimate of 65 to 75 genuine catches among 84 raw detections is engineering-derived. Independent human evaluation under academic oversight is in progress, and we will publish the result whichever way it comes out.

Run a document you would otherwise have to trust on faith.

Upload a file. Get a score, a 95% interval, every flagged claim with its source, and a signed certificate. Three free credits. No card. No call.