Are large vision-language models (VLMs) truly effective? In this work, we show that popular VLMs still struggle with questions about natural images that humans can easily answer, which we term natural adversarial samples. Unlike previous VQA benchmarks such as MME that can be addressed by blind QA models, NaturalBench avoids such shortcuts by pairing each question with two images that yield different answers. We use a simple procedure to collect challenging VQA samples from natural image-text corpora (e.g., Flickr and DOCCI) using foundation models like CLIP and ChatGPT. We collect a new vision-centric VQA benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. We note several interesting findings:
There are already many challenging visual-question-answering (VQA) benchmarks like MMMU, ScienceQA, and MMBench, so why NaturalBench? It turns out that these popular VQA benchmarks aren't as "visual" as they seem. For instance, many questions can be answered using commonsense priors, without relying on the image. For instance, ScienceQA-IMG asks "What is the capital of Massachusetts?", which is easily answered as "Boston" by a blind ChatGPT. Below, we highlight several such examples from six previous benchmarks (MME, MMBench, ScienceQA, MMMU, MMStar, AI2D) that can be solved without looking at the image:
What's more, even carefully constructed benchmarks like MME and MMStar can suffer from imbalanced answers. For example, MME's question "Does this artwork exist in the form of a painting?" is answered "Yes" 97.5% of the time! We show that finetuning a "blind" GPT-3.5 — using only text and no images — on a random half of each benchmark allows it to significantly outperform random chance (see the red dotted line) on the other half. In many cases, blind GPT-3.5 even matches or surpasses LLaVA-1.5 finetuned on the same data but with images!
These issues drive us to create NaturalBench, a vision-centric VQA benchmark immune to blind solutions.
NaturalBench prevents blind solutions by including two questions and two images per sample with alternating answers. This ensures that blind models cannot succeed while giving the same answer regardless of the image or question. We use a semi-automated pipeline to collect NaturalBench:
Given a natural image-text dataset like Flickr30K, (1) we first identify confounding image-text pairs that fail VLMs like CLIP, such as when CLIP mismatches an image with another image’s caption. (2) Next, we prompt ChatGPT to generate questions with different answers for each image, using the provided captions. Finally, human annotators filter out incorrect VQA samples. Unlike previous adversarial benchmarks, NaturalBench does not perturb images or questions but produces natural adversarial samples—questions about natural images that are challenging for VLMs but easy for humans. Below are some examples from NaturalBench, comparing the ground-truth answers with predictions from leading VLMs like GPT-4o, Qwen2-VL, Llama3.2-Vision, and Molmo:
As shown, while humans easily solve these simple questions about natural images (with over 90% accuracy), state-of-the-art models, including the closed-source GPT-4o, still struggle.
We now show that NaturalBench poses a significant challenge to 53 state-of-the-art VLMs. To better understand model performance, beyond binary VQA accuracy (Acc), we introduce three metrics based on NaturalBench's paired image-question format.
As shown in the table, all models fall well short of human performance. Even the latest and strongest models, such as BLIP-3 (XGen-MM), Cambrian-1, LLaVA-OneVision, Llama3.2-Vision, Molmo, and Qwen2-VL, lag 55% to 70% behind humans. The best closed-source model, GPT-4o, is still 52% behind.
Biases: NaturalBench exposes VLMs' biases towards certain answers like "Yes" and "B" regardless of the input image and question. We use the answer likelihood (VQAScore) to perform a scoring-based evaluation by comparing the likelihood of correct (image, question, answer) triples over the incorrect ones to show that proper debiasing can double or triple the performance, even for GPT-4o. This suggests that NaturalBench can be a useful benchmark for evaluating methods that reduce biases (or hallucinations) in VLMs.
Since benchmarks often leak into foundation models' training data, it is crucial to update benchmarks using new data sources. Our benchmark curation method can be easily adapted to new image-text datasets. We expand NaturalBench by incorporating two recently proposed datasets: (1) DOCCI with fine-grained captions over 100 words, and (2) XM3600 with captions in Chinese and Hindi. We hope our efforts will inspire future work on dynamic evaluation of VLMs.
Other than model evaluation, NaturalBench is also useful for model development. For example, we show in the paper that vision finetuning of GPT-4o and LLaVA using half of NaturalBench can significant boost their performance on the second half:
@inproceedings{naturalbench,
title={NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples},
author={Li, Baiqi and Lin, Zhiqiu and Peng, Wenxuan and Nyandwi, Jean de Dieu and Jiang, Daniel and Ma, Zixian and Khanuja, Simran and Krishna, Ranjay and Neubig, Graham and Ramanan, Deva},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
}