The Verification Engine

Seven adversarial AI agents. Five model providers. Authoritative source verification. Cryptographic proof. One API call.

Input

PDFs
Plain text
Markdown
Word documents
Any length

Output

Trust score (0-100)
Per-claim verdicts
Source citations
Full transcript
Ed25519 signature
Timestamp & metadata

Why Seven Agents From Four Providers

A single AI model agreeing with itself is not verification. Our agents debate from different perspectives, using different models, creating consensus through adversarial conflict.

Agent

Role

Gennie

Orchestrator — frames questions, synthesizes findings

Marcus

Attack Surface Analyst — strategic coherence, internal consistency

Thomas

Data Flow Analyst — financial accuracy, regulatory math

Emmy

Evidence & Rigor Analyst — citation verification, scientific consistency

Ada

Pattern Recognition Analyst — anomaly detection, documentation vs. behavior

Luca

Operational Impact Analyst — downstream consequences, deal economics

Pyrrho

Skeptic — challenges every finding, causal reasoning, structural plausibility

Beyond Adversarial Debate

The seven agents and four debate rounds are visible on the product page. What's less obvious is the infrastructure underneath them. Six architectural decisions separate GauntletScore from any tool that runs a document through multiple chatbots.

Causal Validation

Pyrrho is the only agent with access to a dedicated causal_validate pipeline. When a document claims "revenue increased because of the acquisition," Pyrrho evaluates the structural logic: did the cause precede the effect? Is the magnitude proportional? Are there confounding factors? Does the causal claim contain a logical fallacy? A document can cite real numbers and real events and still construct a false causal story.

Bayesian Confidence Calibration

Every claim that receives a VERIFIED verdict carries a calibrated Bayesian confidence score. VERIFIED at 0.95 is materially different from VERIFIED at 0.62 — and the system treats them differently in scoring. Enterprise buyers need to know not just whether a claim cleared the bar, but how firmly it did so. Aggregate uncertainty is reflected in the final composite score rather than hidden behind a round number.

Persistent Knowledge Graph

Verified facts don't disappear between runs. The validation study built a 5,799-fact knowledge graph across 20 companies and 360 evaluations. When a subsequent document references a company already in the graph, the system retrieves verified data directly. Verification cost decreases 20–29% over time. Accuracy compounds with each run.

Self-Improving Optimization

After every analysis, the orchestrator logs tool success rates, agent budget utilization, and verification failures. As of the validation study: 120 runs, 840 agent performance records, and 1,829 accumulated optimization recommendations. The system analyzes its own performance and adjusts. Organizations running GauntletScore at scale are running a system that gets measurably better with each document.

Tool-Augmented Verification

This is not RAG or web search. It is direct, structured queries to authoritative databases: CourtListener for federal case law, eCFR for regulations, SEC EDGAR for financial filings, PubMed for medical literature, and more — plus a mathematical proof engine for arithmetic verification. Tool-augmented verification produced a +37.1 point improvement over reasoning-only analysis in matched A/B pairs.

Structured Four-Round Debate

Each round has a defined purpose. Round 0: independent research against external databases. Round 1: initial positions. Round 2: direct challenge and response with evidence. Round 3: convergence — genuine agreement documented, genuine disagreement preserved. Round 4: final votes and formal synthesis. Claims that cannot be resolved are returned as INCONCLUSIVE — transparency by design.

How It Actually Works

A complete GauntletScore analysis runs in 5–10 minutes. Here is what happens inside the engine during that time.

Document Intake

A document is submitted via API or the web interface. The system accepts up to 2,000,000 characters — approximately 1,000 pages — and processes the complete, unmodified text. No summarization. No truncation. The system detects the document type and activates domain-specific verification tools and agent configurations. Custom verification instructions can be appended at submission time.

Claim Extraction

Gemini Flash processes the document in overlapping chunks and extracts every verifiable factual claim: case citations, regulatory references, financial figures, named entities, dates, statistical assertions, and mathematical calculations. A parallel deterministic pass runs regular expressions to catch legal citation patterns. The two passes are merged and deduplicated. A typical 10-page document produces 50–100 claims; dense regulatory documents produce 300–500 or more.

Knowledge Graph Lookup

Every extracted claim is checked against the persistent knowledge graph — a structured database of previously verified facts accumulated across prior runs. Claims that match existing verified entries are resolved immediately without external queries. This reduces tool costs by 20–29% for organizations running at volume.

Tool Verification

Each new claim is routed to the appropriate authoritative database. CourtListener for case law. eCFR for federal regulations. SEC EDGAR for financial data. PubMed and ClinicalTrials.gov for medical claims. A mathematical proof engine evaluates arithmetic: balance sheet totals, damage calculations, percentage figures. Each tool returns a structured result with source citation and confidence rating.

Round 1 — Initial Analysis

Six specialist agents — Marcus, Thomas, Emmy, Ada, Luca, and Pyrrho — receive tool verification results and the document scope. Each analyzes from their designated lens: strategic coherence, financial accuracy, evidentiary rigor, pattern recognition, and operational impact. Each produces an independent research brief before seeing any other agent's findings. Gennie frames the key questions and identifies areas of disagreement.

Round 2 — Cross-Examination

Pyrrho, the dedicated skeptic, challenges every significant finding from Round 1. Claims marked VERIFIED are subjected to adversarial scrutiny. Causal claims are routed through the causal_validate pipeline — checking temporal ordering, proportionality, and logical structure. Other agents defend or revise their findings with evidence. This is structured adversarial pressure, not collaborative refinement.

Round 3 — Rebuttal

Agents respond to challenges with evidence from tool verification and the knowledge graph. Claims that cannot withstand scrutiny are downgraded. Claims that survive are reinforced with additional source citations. Genuine multi-agent consensus is documented. Persistent disagreements — where agents hold opposing evidence-backed positions — are preserved intact.

Round 4 — Final Synthesis

Gennie synthesizes all four rounds into per-claim verdicts. Bayesian confidence calibration is applied. Each claim receives one of four verdicts: VERIFIED (confirmed against authoritative sources), DEBUNKED (contradicted by source data), INCONCLUSIVE (insufficient or conflicting evidence), or NOT_FOUND (not locatable in available databases).

Scoring and Certification

A composite score (0–100) is calculated across six weighted components: citation verification quality (25 pts), skeptic rigor (20 pts), cross-persona consensus (15 pts), evidence quality (15 pts), document grounding (15 pts), and debate completeness (10 pts). An Ed25519 cryptographic certificate is generated containing the document hash, transcript hash, score, and timestamp — providing non-repudiation for any third party.

Authoritative Verification Sources

CourtListener

eCFR

SEC EDGAR

PubMed

ClinicalTrials.gov

CrossRef

38 CFR Part 4

Math Verifier

What the Score Means

90-100Highly Trustworthy

Strong consensus across agents. Minor discrepancies only.

80-89Generally Trustworthy

Solid consensus. Some contested claims or inconclusive verdicts.

70-79Proceed with Caution

Moderate disagreement. Significant inconclusive verdicts. Requires review.

60-69High Risk

Substantial disagreement. Multiple DEBUNKED claims. Professional judgment essential.

Below 60Not Trustworthy

Severe consensus failure or majority DEBUNKED verdicts. Do not rely on this document.

Ready to verify documents?

GauntletScore provides assistive verification and is not a substitute for professional judgment.