Cross-source confidence scoring: catching wrong extractions before they post

Here's the quiet problem with single-engine invoice capture: it always returns an answer, and it always sounds equally sure. The OCR step reads "$8,415.60" for the invoice total whether that figure is crisp and unambiguous or smudged and half-guessed. The number lands in the field, the invoice flows toward approval, and nobody knows which extractions were solid and which were coin-flips.

The cost of that uncertainty is asymmetric. A wrong supplier name is annoying. A wrong amount that auto-approves and posts to the ERP is a payment error — and the whole point of touchless AP is to post without a human checking. You can't responsibly automate approval if you can't tell a confident extraction from a shaky one.

Why a single source can't score itself

Every extraction engine emits some internal confidence signal, but it's the model grading its own homework. A model that misreads a field can be entirely confident it read it correctly — the confidence reflects how cleanly it matched its own pattern, not whether the answer is right.

The way out is independent corroboration. Run two different extraction sources over the same document and compare their answers:

A semantic engine (a document-AI model — for example, AWS Bedrock Data Automation behind a Claude model) is strong at understanding context: which value is the grand total versus a subtotal, what the supplier name is, how to read non-standard layouts.
A structural engine (a forms/expense analyzer such as AWS Textract) is strong at the mechanical read: the literal characters in a field, the table grid, the line-item rows.

Run them in parallel, not in sequence — neither waits on the other, so you get two opinions in roughly the time one would take. Then the interesting question becomes: do they agree?

Agreement as the confidence signal

The core insight is that agreement between two independent sources is a far better confidence signal than either source's self-reported certainty. When a semantic model and a structural analyzer both land on "$8,415.60" for the total, that's strong evidence. When one says "$8,415.60" and the other says "$8,415.50," that disagreement is exactly the flag you want.

A practical scoring model assigns per-field confidence based on the relationship between the two sources:

Both sources agree → highest confidence. For numeric fields, prefer the structural read; for semantic fields like supplier name or description, prefer the semantic read.
Both sources present but disagree → low confidence. Keep a value, but flag the field for review.
Only one source extracted the field → moderate confidence. Usable, but never treated as certain.

Comparison has to be normalized, not literal. $8,415.60, 8415.60, and 8,415.6 are the same number — strip currency symbols and commas and compare as floats with a small tolerance. Text fields compare case-insensitively after trimming. Without normalization you'd flag agreement as disagreement constantly and drown the review queue.

From per-field scores to a route decision

Per-field confidence rolls up into an overall agreement score — essentially, what fraction of the fields the two sources could compare actually matched. That single number, combined with a couple of hard rules, drives the gating decision:

An invoice is flagged for review if the overall score falls below a threshold, or if any monetary field (total, tax, amount due) scored low — even when everything else agrees. A disagreement on an amount is never acceptable in a touchless path, regardless of how good the rest of the extraction looks.
An invoice is eligible for touchless / auto-approval only when the score clears the threshold, no review flag is set, and the other business rules (it's a PO invoice, the amount is within tolerance) hold.

The result is a queue that sorts itself. High-agreement invoices flow straight through. The genuinely ambiguous ones — and only those — land in front of a human, with the specific disagreeing field highlighted so the reviewer fixes one value instead of re-keying the whole invoice.

The economics of gating

This is the lever that makes touchless AP both fast and safe. Without confidence scoring you face a bad choice: auto-approve everything (and post the occasional wrong amount) or review everything (and you've automated nothing). Cross-source scoring lets you auto-approve the invoices where two independent engines agree, and reserve human attention for the small slice where they don't — which is precisely where human attention is worth paying for.

It also compounds with a self-improving extraction loop: the fields that get flagged and corrected are the same corrections that feed back into better extraction next time, so the review queue shrinks as the system learns your suppliers.

Where this fits

Confidence scoring is one expression of a broader principle behind ERP-native AP automation: real validation means reasoning about your extraction, not blindly trusting a single field. The same discipline shows up when posting into Oracle EBS and Fusion, where a wrong amount isn't a cosmetic glitch — it's a payment that has to be unwound. And it pairs naturally with the full taxonomy of invoice scenarios a healthcare AP team handles on Fusion, where catching the wrong extraction early is the difference between automation and rework.

Where EZ Cloud fits

EZ Cloud runs two independent extraction sources over every invoice in parallel and scores their agreement field by field. Where they agree, the invoice is eligible to flow straight through; where they disagree — especially on a monetary field — it's routed for review with the specific field flagged. You get the speed of touchless processing without betting your payments on a single engine's self-assessment.

If your AP automation can't tell you which extractions to trust, you're either reviewing everything or posting the occasional error you'll only find at reconciliation. That's the gap confidence scoring closes.