Revisiting the Role of Language Priors in Vision-Language Models

Abstract

Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study generative VLMs that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the Visual Generative Pre-Training Score (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a "blind" language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.

Generative Models for Discriminative Tasks

While the performance of vision-language models (VLMs) is impressive, many open challenges remain. Recent analysis points out that VLMs may often degrade to "bag-of-words" that confuse captions such as "the horse is eating the grass" and "the grass is eating the horse". This makes it difficult to use VLMs to capture compositions of objects, attributes, and their relations. But somewhat interestingly, large-scale language models (LLMs) trained for autoregressive next-token prediction like GPT are able to discern such distinctions. A related but under-appreciated difficulty is that of benchmarking the performance of visio-linguistic reasoning. Perhaps the most well-known example in the community is that of the influential VQA benchmarks, which could be largely solved by exploiting linguistic biases in the dataset -- concretely, questions about images could often be answered by "blind" language-only models that did not look at the image. In particular, we examine this by revisiting the role of language priors (P(text)) in VLMs. Ironically, we find that blind language-only models still excel on many image-text retrieval tasks that assess compositional reasoning. To address these challenges, we propose a probablistic treatment by repurposing generative VLMs for discriminative tasks (like retrieval). In particular, we set the match score for an image-text pair to be the probability that the VLM would generate that text from the given image, or P(text|image). We call this probability score the Visual Generative Pre-Training Score, or VisualGPTScore. We observe that the VisualGPTScore performs surprisingly well on many benchmarks, e.g., producing near-perfect accuracy on ARO. However, it still struggles on other benchmarks such as Winoground. We analyze this performance discrepancy through a probabilistic lens by deriving the language prior P(text) from VLMs via Monte-Carlo sampling.

The Role of Language Priors in VLMs

Our key insight is that many benchmark biases can be formalized as mismatching distributions over text between train and test data - P_train(text) versus P_test(text). We use a first-principles analysis to account for distribution shift by simply reweighting the VisualGPTScore with the Bayes factor P_test(text)/P_train(text), a process we call debiasing. To compute the Bayes reweighting factor, we need access to both the train and test language prior. We compute P_train(text) from an OTS VLM by drawing Monte-Carlo samples of P_train(text|image) from trainset or Guassian noise images. Because P_test(text) may require access to the test set, we explore simplifying assumptions that it is (a) identical to P_train(text) (Scenario 1), (b) uninformative/uniform (Scenario 2), or (c) tunable from a held-out val set. Our analysis helps explain the strong performance of the VisualGPTScore on certain benchmarks and its poor performance on others. Moreover, our analysis offers simple strategies to improve performance through debiasing.

Scenario 1 (left) constructs negative captions by shuffling words in the true caption (as in ARO-Flickr), but this produces implausible text such as white a duck spreads its wings in while the water. Here, exploiting the language bias of the training set will help since it will downweight the match score for such implausible negative captions. In fact, we show that a blind language-only model can easily identify the correct caption. Scenario 2 (right) constructs negative captions that are curated to be plausible (as in SugarCrepe). Here, the language bias of the training set may hurt, since it will prefer to match common captions that score well under the language prior; i.e., the incorrect caption of people are cooking in a kitchen is more likely than the true caption of people are posing in a kitchen under the language prior, and so removing the language bias improves performance.

SOTA performance across ARO/Crepe/SugarCrepe/VL-CheckList

We implement OTS VisualGPTScore using the open-source image-conditioned language model BLIP, and achieve SOTA performance on all recent image-to-text (I-to-T) retrieval tasks, oftentimes surpassing prior art by a great margin. Notably, these approaches typically requires costly fine-tuning of CLIP with much more curated data, e.g., DAC use multiple foundation models including ChatGPT and SAM to perform data augmentation.

We begin by evaluating blind language models (in red). Surprisingly, this already produces SOTA accuracy on certain benchmarks such as ARO-Flickr, compared to the best discriminative approaches (in gray). We also find that blind inference of generative VLMs, P_train(t) via sampling Gaussian noise images (in blue), often performs better and achieve above-chance performance even on the most recent SugarCrepe. Next, we show that simply repurposing a generative VLM's language generation head for computing image-text scores (VisualGPTScore in yellow), which corresponds to α = 0, consistently produces SOTA accuracy across all benchmarks. Finally, debiasing this score by tuning α on val set (in green) further improves performance, establishing the new SOTA.

Additional Experimental Results

We show that our novel debiasing solution based on Gaussian Noise images can consistently improve VisualGPTScore on both balanced compositionality (Winoground/EqBen) and large-scale retrieval benchmarks (COCO/Flickr30K/ImageNet).

Image showing Winoground/COCO/ImageNet results

While OTS generative scores do not work well, debiasing with a larger α close to 1 can consistently and often significantly improve I-to-T performance. To highlight the improvement, we mark results without debiasing (α=0) (in yellow), debiasing with a fixed α=1 (in pink), and cross-validation using held-out val sets (α=α_val*) (in green)

Furthermore, OTS VisualGPTScore achieves robust text-to-image (T-to-I) retrieval performance across the board, competitive with the ITMScore.

Limitations and Future Work

Our analysis is based on simplified assumptions. For instance, the model might not accurately represent P_train(text|image), a phenomenon we examine in our paper. Estimating P_train(text) by sampling gaussian noise images is potentially imprecise, and we encourage future VLMs to directly model P_train(text) or explore better techniques for debiasing. Lastly, we encourage readers to explore our latest work on VQAScore, which adapts VisualGPTScore to more capable vision-language models trained on visual-question-answering data (rather than solely on captioning data) for state-of-the-art vision-language alignment.

BibTeX

@inproceedings{lin2024revisiting,
        title={Revisiting the Role of Language Priors in Vision-Language Models},
        author={Lin, Zhiqiu and Chen, Xinyue and Pathak, Deepak and Zhang, Pengchuan and Ramanan, Deva},
        booktitle={International Conference on Machine Learning},
        year={2024},
        organization={PMLR}
      }

[ICML 2024]