GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Abstract

While text-to-visual models now produce photo-realistic images and videos, they still struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. We introduce GenAI-Bench to evaluate the performance of state-of-the-art generative models in various aspects of compositional text-to-visual generation. Our contributions are as follow:

GenAI-Bench. We collect 1,600 text prompts from graphic designers who use text-to-visual tools like Midjourney in their profession. Compared to previous benchmarks like PartiPrompt and T2I-CompBench, GenAI-Bench covers a wider range of compositional reasoning skills and poses a greater challenge to leading generative models.
Human Studies. We hire human annotators to rate 10 leading open-sourced and close-sourced generative models, such as DALL-E 3, Stable Diffusion, Pika, and Gen2. We will release 38,400 human ratings for benchmarking automated evaluation metrics. In addition, our studies highlight that VQAScore correlates better with human judgments on compositional text prompts than existing metrics like CLIPScore, PickScore, HPSv2, and Davidsonian Scene Graph.
Improving Generation. We show that VQAScore can also improve image generation by simply selecting the highest-VQAScore images from a few generated candidates. This method can even enhance the state-of-the-art API-based (black-box) models like DALL-E 3! We will release a GenAI-Rank benchmark with 43,200 human ratings on DALL-E 3 and SD-XL images generated by the same prompt.

GenAI-Bench for Text-to-Visual Evaluation

GenAI-Bench reflects how users seek precise control in text-to-visual generation using compositional prompts prompts shown in green. For example, users might add details by specifying basic compositions of objects, scenes, attributes, and relationships (spatial/action/part) (shown in gray). Additionally, user prompts may involve advanced reasoning, including counting, comparison, differentiation, and logic (negation/universality) (shown in blue). We carefully define and label these skills in the paper.

We collect prompts from professional designers to ensure GenAI-Bench reflects real-world needs. Designers write prompts on general topics (e.g., food, animals, household objects) without copyrighted characters or celebrities. We carefully tag each prompt with its evaluated skills and hire human annotators to rate images and videos generated by state-of-the-art models.

Image comparing GenAI-Bench to other benchmarks

Compared to existing benchmarks, GenAI-Bench covers more crucial aspects of compositional text-to-visual generation and poses a greater challenge to leading generative models:

Human Studies

We evaluate 6 text-to-image models: Stable Diffusion (SD v2.1, SD-XL, SD-XL Turbo), DeepFloyd-IF, Midjourney v6, DALL-E 3; along with 4 text-to-video models: ModelScope, Floor33, Pika v1, Gen2. For human evaluations, we hire three annotators to collect 1-to-5 Likert scale human ratings for image-text or video-text alignment using the recommended annotation protocol of Otani et al. (CVPR 2023).

These human ratings also allow us to benchmark automated evaluation metrics. We find that VQAScore correlates better with human ratings than all prior art, including CLIPScore, PickScore, ImageReward, HPSv2, TIFA, VQ2, and Davidsonian Scene Graph.

VQAScore is remarkable simple yet effective. It can be computed end-to-end using an off-the-shelf VQA model as the probability of 'Yes' conditioned on the image and a simple question, as shown below. We refer the reader to our VQAScore page for more details.

Improving Image Generation using VQAScore

VQAScore can improve text-to-image generation by selecting the highest-VQAScore images from a few (3 to 9) generated candidates. Below we show how we use VQAScore to improve the state-of-the-art DALL-E 3 model using its black-box API:

We perform extensive human studies to demonstrate that images with the highest VQAScore images are more aligned with text prompts according to human judgments. Notably, VQAScore outperforms other metrics, such as CLIPScore, PickScore, and Davidsonian Scene Graph. For benchmarking purposes, we will release the GenAI-Rank benchmark with all human ratings -- 800 prompts, 9 images per prompt, for 2 models: DALL-E 3 and SD-XL.

BibTeX