While text-to-visual models now produce photo-realistic images and videos, they still struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. We introduce GenAI-Bench to evaluate the performance of state-of-the-art generative models in various aspects of compositional text-to-visual generation. Our contributions are as follow:
GenAI-Bench reflects how users seek precise control in text-to-visual generation using compositional prompts prompts shown in green. For example, users might add details by specifying basic compositions of objects, scenes, attributes, and relationships (spatial/action/part) (shown in gray). Additionally, user prompts may involve advanced reasoning, including counting, comparison, differentiation, and logic (negation/universality) (shown in blue). We carefully define and label these skills in the paper.
We collect prompts from professional designers to ensure GenAI-Bench reflects real-world needs. Designers write prompts on general topics (e.g., food, animals, household objects) without copyrighted characters or celebrities. We carefully tag each prompt with its evaluated skills and hire human annotators to rate images and videos generated by state-of-the-art models.
Compared to existing benchmarks, GenAI-Bench covers more crucial aspects of compositional text-to-visual generation and poses a greater challenge to leading generative models:
We evaluate 6 text-to-image models: Stable Diffusion (SD v2.1, SD-XL, SD-XL Turbo), DeepFloyd-IF, Midjourney v6, DALL-E 3; along with 4 text-to-video models: ModelScope, Floor33, Pika v1, Gen2. For human evaluations, we hire three annotators to collect 1-to-5 Likert scale human ratings for image-text or video-text alignment using the recommended annotation protocol of Otani et al. (CVPR 2023).
These human ratings also allow us to benchmark automated evaluation metrics. We find that VQAScore correlates better with human ratings than all prior art, including CLIPScore, PickScore, ImageReward, HPSv2, TIFA, VQ2, and Davidsonian Scene Graph.
VQAScore is remarkable simple yet effective. It can be computed end-to-end using an off-the-shelf VQA model as the probability of 'Yes' conditioned on the image and a simple question, as shown below. We refer the reader to our VQAScore page for more details.
VQAScore can improve text-to-image generation by selecting the highest-VQAScore images from a few (3 to 9) generated candidates. Below we show how we use VQAScore to improve the state-of-the-art DALL-E 3 model using its black-box API:
We perform extensive human studies to demonstrate that images with the highest VQAScore images are more aligned with text prompts according to human judgments. Notably, VQAScore outperforms other metrics, such as CLIPScore, PickScore, and Davidsonian Scene Graph. For benchmarking purposes, we will release the GenAI-Rank benchmark with all human ratings -- 800 prompts, 9 images per prompt, for 2 models: DALL-E 3 and SD-XL.
@article{li2024genaibench,
title={GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation},
author={Li, Baiqi and Lin, Zhiqiu and Pathak, Deepak and Li, Jiayao and Fei, Yixin and Wu, Kewen and Ling, Tiffany and Xia, Xide and Zhang, Pengchuan and Neubig, Graham and Ramanan, Deva},
booktitle={arXiv preprint arXiv:2406.13743},
year={2024}
}