[NeurIPS 2024 D&B]

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

1Carnegie Mellon University
, 2University of Washington
, *Co-first authors
, &Co-senior authors

Abstract

Are large vision-language models (VLMs) truly effective? In this work, we show that popular VLMs still struggle with questions about natural images that humans can easily answer, which we term natural adversarial samples. Unlike previous VQA benchmarks such as MME that can be addressed by blind QA models, NaturalBench avoids such shortcuts by pairing each question with two images that yield different answers. We use a simple procedure to collect challenging VQA samples from natural image-text corpora (e.g., Flickr and DOCCI) using foundation models like CLIP and ChatGPT. We collect a new vision-centric VQA benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. We note several interesting findings:

  1. NaturalBench is hard. We evaluate 53 popular VLMs including BLIP-3, mPLUG-Owl2, InternLM-XC2, LLaVA-OneVision, Llama3.2, InternVL2, Cambrain-1, Qwen2-VL, and Molmo. Most of them only achieve 1%-20% above random chance performance. Even the best (closed-source) GPT-4o lags 50% behind human performance (which is above 90%).
  2. NaturalBench is compositional. NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. Unlike previous work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for more fine-grained evaluation.
  3. NaturalBench exposes significant biases in VLMs. Most VLMs choose the same answer regardless of the input image (or question). We show that debiasing can be crucial for better performance.

Why NaturalBench?

There are already many challenging visual-question-answering (VQA) benchmarks like MMMU, ScienceQA, and MMBench, so why NaturalBench? It turns out that these popular VQA benchmarks aren't as "visual" as they seem. For instance, many questions can be answered using commonsense priors, without relying on the image. For instance, ScienceQA-IMG asks "What is the capital of Massachusetts?", which is easily answered as "Boston" by a blind ChatGPT. Below, we highlight several such examples from six previous benchmarks (MME, MMBench, ScienceQA, MMMU, MMStar, AI2D) that can be solved without looking at the image:

Image illustrating blind priors

What's more, even carefully constructed benchmarks like MME and MMStar can suffer from imbalanced answers. For example, MME's question "Does this artwork exist in the form of a painting?" is answered "Yes" 97.5% of the time! We show that finetuning a "blind" GPT-3.5 — using only text and no images — on a random half of each benchmark allows it to significantly outperform random chance (see the red dotted line) on the other half. In many cases, blind GPT-3.5 even matches or surpasses LLaVA-1.5 finetuned on the same data but with images!

Image illustrating finetuning results

These issues drive us to create NaturalBench, a vision-centric VQA benchmark immune to blind solutions.

Collecting a Vision-Centric VQA Benchmark

NaturalBench prevents blind solutions by including two questions and two images per sample with alternating answers. This ensures that blind models cannot succeed while giving the same answer regardless of the image or question. We use a semi-automated pipeline to collect NaturalBench:

Image illustrating NaturalBench collection

Given a natural image-text dataset like Flickr30K, (1) we first identify confounding image-text pairs that fail VLMs like CLIP, such as when CLIP mismatches an image with another image’s caption. (2) Next, we prompt ChatGPT to generate questions with different answers for each image, using the provided captions. Finally, human annotators filter out incorrect VQA samples. Unlike previous adversarial benchmarks, NaturalBench does not perturb images or questions but produces natural adversarial samples—questions about natural images that are challenging for VLMs but easy for humans. Below are some examples from NaturalBench, comparing the ground-truth answers with predictions from leading VLMs like GPT-4o, Qwen2-VL, Llama3.2-Vision, and Molmo:

Image illustrating NaturalBench

As shown, while humans easily solve these simple questions about natural images (with over 90% accuracy), state-of-the-art models, including the closed-source GPT-4o, still struggle.

NaturalBench challenges leading VLMs

We now show that NaturalBench poses a significant challenge to 53 state-of-the-art VLMs. To better understand model performance, beyond binary VQA accuracy (Acc), we introduce three metrics based on NaturalBench's paired image-question format.

  • Question Accuracy (Q-Acc): Awards a point only if a model correctly answers a question for both images.
  • Image Accuracy (I-Acc): Awards a point when a model correctly answers both questions for an image.
  • Group Accuracy (G-Acc): Awards a point only when a model correctly answers all four (image, question) pairs in a test sample.
Below, we present the performance of VLMs across these metrics. We also highlight the performance gap (in terms of G-Acc) compared to humans in red:

Image illustrating Performance

As shown in the table, all models fall well short of human performance. Even the latest and strongest models, such as BLIP-3 (XGen-MM), Cambrian-1, LLaVA-OneVision, Llama3.2-Vision, Molmo, and Qwen2-VL, lag 55% to 70% behind humans. The best closed-source model, GPT-4o, is still 52% behind.

Why is NaturalBench challenging?

Compositionality: Solving a NaturalBench sample always requires a combination of visio-linguitic reasoning skills, including object recognition, attribute binding, relation understanding, and advanced reasoning such as logic, comparison, differentiation (instance discrimination), counting, and world knowledge. We tag each (image, question) pair with all associated skills for a fine-grained analysis. In the paper, we show that even the best VLMs like GPT-4o still struggle with skills such as spatial orientation.

Image illustrating Tags

Biases: NaturalBench exposes VLMs' biases towards certain answers like "Yes" and "B" regardless of the input image and question. We use the answer likelihood (VQAScore) to perform a scoring-based evaluation by comparing the likelihood of correct (image, question, answer) triples over the incorrect ones to show that proper debiasing can double or triple the performance, even for GPT-4o. This suggests that NaturalBench can be a useful benchmark for evaluating methods that reduce biases (or hallucinations) in VLMs.

Image illustrating Tags

Towards Dynamic Evaluation

Since benchmarks often leak into foundation models' training data, it is crucial to update benchmarks using new data sources. Our benchmark curation method can be easily adapted to new image-text datasets. We expand NaturalBench by incorporating two recently proposed datasets: (1) DOCCI with fine-grained captions over 100 words, and (2) XM3600 with captions in Chinese and Hindi. We hope our efforts will inspire future work on dynamic evaluation of VLMs.

Future Work

Other than model evaluation, NaturalBench is also useful for model development. For example, we show in the paper that vision finetuning of GPT-4o and LLaVA using half of NaturalBench can significant boost their performance on the second half:

Image illustrating Performance Boost of Vision Finetuning
Given the promising results, we plan to scale up NaturalBench for model post-training in future work.

BibTeX

@inproceedings{naturalbench,
        title={NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples},
        author={Li, Baiqi and Lin, Zhiqiu and Peng, Wenxuan and Nyandwi, Jean de Dieu and Jiang, Daniel and Ma, Zixian and Khanuja, Simran and Krishna, Ranjay and Neubig, Graham and Ramanan, Deva},
        booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
        year={2024},
        }