VLMBench · A Document Extraction Benchmark

How well do frontier models read documents?

Document-grounded vision-language model evaluation — measuring extraction accuracy across leases, financial statements, congressional records, and scientific diagrams. 550 questions, 30 runs per model, every answer graded.

5 models
550 questions
82,500 graded cells
v1 · 2026

What this measures

VLMBench is not a universal ranking of model intelligence. It measures reliability on a specific class of document-grounded VLM tasks: reading values off a page image and returning them accurately.

Method

Every model received the same image evidence and the same question prompt. Each question was run 30 times per model. Answers are graded against a golden value, with acceptable variants separated from material failures. The goal is to study reliability and failure modes — not to crown a smartest model.

Overall ranking

Acceptable answer rate

The share of answers that were either an exact match or a materially correct variation. Out of 16,500 graded cells per model. Gemini 3.1 Pro is shown partial (5 of 30 runs) pending full evaluation.

How models fail

Failure mode breakdown

Among materially wrong answers, the kind of error each model made. Same overall verdict, very different pathologies.

Consistency vs. correctness

Confidently right, or confidently wrong?

Each model run 30× per question. The danger zone is the bottom-right: high agreement, low accuracy — a model that reliably gives the same wrong answer.

How grading works

Three verdicts

Every answer is judged against a golden answer. Two of the three verdicts count as acceptable — exactness and correctness are different bars.

Exact match

The answer is a normalized string match to the golden answer. Deterministic, no judgment needed.

Materially correct

Presentationally different but correct — a dropped unit, a spelled-out number, a reworded list. Still acceptable.

Materially wrong

A real failure: wrong value, wrong magnitude, a miscount, or refusing to answer visible content.