Despite significant progress in generative AI, scientific evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations.
Compared to the bag-of-words CLIPScore (in red), VQAScore (in green) based on our CLIP-FlanT5 model correlates better with human judgments on images generated from compositional text prompts that involve attribute bindings, spatial/action/part relations, and higher-order reasoning such as negation and comparison.
VQAScore is remarkably simple yet effective. It can be computed end-to-end using an off-the-shelf VQA model as the probability of 'Yes' conditioned on the image and a simple question, such as 'Does this figure show "{text}"? Please answer yes or no.'
We find it beneficial to use a bidirectional image-question encoder that allows visual embeddings to be influenced by the question being asked (and vice versa). We operationalize this via finetuning a CLIP-FlanT5 model on public VQA datasets. This model sets a new state-of-the-art in text-to-image/video/3D evaluation, without using costly human feedback.
We introduce a comprehensive benchmark for compositional text-to-visual generation, challenging even leading models like DALL-E 3 and Gen2. GenAI-Bench provides fine-grained tags for both basic (attribute/scene/relation) and advanced (counting/differentiation/comparison/logic) compositional reasoning skills. Used alongside VQAScore, GenAI-Bench enables reproducible evaluation of generative models. To verify VQAScore's agreement with human judgments, we also collect extensive human ratings on ten image and video generative models. We plan to release these ratings to evaluate future automated metrics.
@article{lin2024evaluating,
title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
journal={arXiv preprint arXiv:2404.01291},
year={2024}
}