Evaluating Text-to-Visual Generation with Image-to-Text Generation

1Carnegie Mellon University
, 2Meta


Despite significant progress in generative AI, scientific evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations.

  1. VQAScore. We propose a simple metric that outperforms prior art without making use of expensive human feedback or proprietary models such as ChatGPT and GPT4-Vision.
  2. CLIP-FlanT5. Our in-house VQA model achieves the state-of-the-art VQAScore for text-to-image/video/3D evaluation, offering a strong alternative to CLIPScore.
  3. GenAI-Bench. We introduce a text-to-visual benchmark with real-world compositional prompts to evaluate generative models and automated metrics, surpassing the difficulty of existing benchmarks.

VQAScore for Text-to-Visual Evaluation

Image illustrating VQAScore Image illustrating VQAScore

Compared to the bag-of-words CLIPScore (in red), VQAScore (in green) based on our CLIP-FlanT5 model correlates better with human judgments on images generated from compositional text prompts that involve attribute bindings, spatial/action/part relations, and higher-order reasoning such as negation and comparison.

Computing VQAScore via CLIP-FlanT5

Image showing VQAScore

VQAScore is remarkably simple yet effective. It can be computed end-to-end using an off-the-shelf VQA model as the probability of 'Yes' conditioned on the image and a simple question, such as 'Does this figure show "{text}"? Please answer yes or no.'

Image showing CLIP-FlanT5

We find it beneficial to use a bidirectional image-question encoder that allows visual embeddings to be influenced by the question being asked (and vice versa). We operationalize this via finetuning a CLIP-FlanT5 model on public VQA datasets. This model sets a new state-of-the-art in text-to-image/video/3D evaluation, without using costly human feedback.


Visualization of GenAI-Bench

We introduce a comprehensive benchmark for compositional text-to-visual generation, challenging even leading models like DALL-E 3 and Gen2. GenAI-Bench provides fine-grained tags for both basic (attribute/scene/relation) and advanced (counting/differentiation/comparison/logic) compositional reasoning skills. Used alongside VQAScore, GenAI-Bench enables reproducible evaluation of generative models. To verify VQAScore's agreement with human judgments, we also collect extensive human ratings on ten image and video generative models. We plan to release these ratings to evaluate future automated metrics.


        title={Evaluating Text-to-Visual Generation with Image-to-Text Generation},
        author={Lin, Zhiqiu and Pathak, Deepak and Li, Baiqi and Li, Jiayao and Xia, Xide and Neubig, Graham and Zhang, Pengchuan and Ramanan, Deva},
        journal={arXiv preprint arXiv:2404.01291},