Document-grounded vision-language model evaluation — measuring extraction accuracy across leases, financial statements, congressional records, and scientific diagrams. 550 questions, 30 runs per model, every answer graded.
VLMBench is not a universal ranking of model intelligence. It measures reliability on a specific class of document-grounded VLM tasks: reading values off a page image and returning them accurately.
Every model received the same image evidence and the same question prompt. Each question was run 30 times per model. Answers are graded against a golden value, with acceptable variants separated from material failures. The goal is to study reliability and failure modes — not to crown a smartest model.
The share of answers that were either an exact match or a materially correct variation. Out of 16,500 graded cells per model. Gemini 3.1 Pro is shown partial (5 of 30 runs) pending full evaluation.
Among materially wrong answers, the kind of error each model made. Same overall verdict, very different pathologies.
Each model run 30× per question. The danger zone is the bottom-right: high agreement, low accuracy — a model that reliably gives the same wrong answer.
Every answer is judged against a golden answer. Two of the three verdicts count as acceptable — exactness and correctness are different bars.
The answer is a normalized string match to the golden answer. Deterministic, no judgment needed.
Presentationally different but correct — a dropped unit, a spelled-out number, a reworded list. Still acceptable.
A real failure: wrong value, wrong magnitude, a miscount, or refusing to answer visible content.