Hi! I am a final year PhD student at Robotics Institute of Carnegie Mellon University, advised by Prof. Deva Ramanan. I did my undergrad in Computer Science and Maths at Cornell University and served as college symbol bearer (top 5 of the college). My current research focuses on computer vision and language, especially evaluating and improving multimodal generative models.

🔥 News

đź“ť Publications

NeurIPS 2024 (Datasets and Benchmarks).
sym

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Zhiqiu Lin*, Baiqi Li*, Wenxuan Peng*, Jean de Nyandwi*, Zixian Ma, Simran Khanuja, Ranjay Krishna*, Graham Neubig*, Deva Ramanan*

Website | Arxiv |

  • We present NaturalBench, a vision-centric VQA benchmark with simple questions about natural images that humans find easy but significantly challenge state-of-the-art models like GPT-4o, Molmo, Llama3.2, LLaVA-OneVision, and Qwen2-VL.
  • We show that debiasing can nearly double model performance, even for GPT-4o!
Best Short Paper at SynData@CVPR 2024.
sym

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Zhiqiu Lin*, Baiqi Li*, Deepak Pathak, Emily Li, Yixin Fei, Kewen Wu, Xide Xia*, Pengchuan Zhang*, Graham Neubig*, Deva Ramanan*

Website | Arxiv

  • We propose GenAI-Bench, a comprehensive benchmark for compositional text-to-visual generation collected by professional designers.
  • We release over 80,000 human ratings to support future evaluation of automated metrics.
  • We show that VQAScore can be used to improve black-box generative models such as DALL-E 3!
ECCV 2024
sym

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Emily Li, Xide Xia, Graham Neubig, Pengchuan Zhang*, Deva Ramanan*

Website | Arxiv |

  • We propose VQAScore, the state-of-the-art alignment metric for text-to-image/video/3D models.
  • VQAScore based on our new CLIP-FlanT5 model outperforms previous metrics based on GPT-4Vision or costly human feedback.
ICML 2024
sym

Revisiting the Role of Language Priors in Vision-Language Models (VisualGPTScore)

Zhiqiu Lin*, Xinyue Chen*, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

Website | Arxiv |

  • We use generative VLMs to implement Visual Generative Pre-Training Score (VisualGPTScore), i.e., the probablity score of generating a text given an image.
  • Such a generative score achieves top-tier image-text retrieval performance on multiple compositionality benchmarks, surpassing all discriminative approaches by a great margin.
  • We further investigate the role of language prior P(text) through a probablistic lens, and introduce a debiasing solution that consistently improves the VisualGPTScore under train-test distribution shifts over text.
CVPR 2024
sym

Language Models as Black-Box Optimizers for Vision-Language Models

Zhiqiu Lin*, Shihong Liu*, Samuel Yu*, Ryan Lee, Tiffany Ling, Deepak Pathak, Deva Ramanan

Website | Arxiv

  • We use ChatGPT to effectively optimize vision-language models without white-box access to model weights or gradients.
  • We show successful applications in visual classification, text-to-image generation and personalization.
CVPR 2024
sym

The Neglected Tails of Vision-Language Models

Zhiqiu Lin*, Shubham Parashar*, Tian Liu*, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong

Website | Arxiv

  • Popular vision-language models (CLIP, MetaCLIP, OpenCLIP) are all long-tailed learners trained on drastically imbalanced web data, causing biases in downstream applications such as visual chatbots (GPT-4Vision) and T2I generation (Stable Diffusion, DALL-E 3).
  • We fix these biases through our SOTA prompting and retrieval-augmented strategies.
CVPR 2023
sym

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Zhiqiu Lin*, Samuel Yu*, Zhiyi Kuang, Deepak Pathak, Deva Ramanan

Website | Arxiv |

  • We propose a simple cross-modal adaptation method for multimodal models that repurposes information from other modalities (e.g., class names and audio clips) as additional training samples.
  • For CLIP, it achieves SOTA few-shot adaptation performance even with a simple linear probe, and consistently improves prior art such as prompting, adapter, and weight ensembling.
  • Audiovisual experiments with AudioCLIP suggest that one can learn a better dog visual classifier by listening to them bark.
NeurIPS 2022
sym

LECO: Continual Learning with Evolving Class Ontologies

Zhiqiu Lin, Deepak Pathak, Yu-Xiong Wang, Deva Ramanan*, Shu Kong*

Website | Arxiv | NeurIPS’22 Talk

  • A practical lifelong vision benchmark motivated by real-world dataset versioning issues, e.g., Mapillary 1.2 to 2.0.
  • Simple but effective solutions such as joint training, semi-supervised learning, and learning-with-partial-labels to address inconsistent annotation (both coarse-grained and fine-grained).
NeurIPS 2021 (Datasets and Benchmarks)
sym

The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Zhiqiu Lin, Jia Shi, Deepak Pathak*, Deva Ramanan*

CLEAR Wiki | NeurIPS Paper Site | Arxiv | CVPR’22 Talk

  • The first continual benchmark for visual recognition with natural distribution shifts over a decade!
  • CLEAR has a 10- and 100-classes version (download links), similar to the famous CIFAR-10 and CIFAR-100 benchmarks.
  • 1st CLEAR challenge was hosted on June 19th, 2022. We have 79 participants from 21 different countries and regions signed up for the challenge!
CVPR 2020 (Best Paper Nomination)
sym

Visual Chirality

Zhiqiu Lin, Jin Sun, Abe Davis, Noah Snavely

Website | Arxiv | Video |

  • How does reflection change what we learn from images? Despite widespread use in data augmentation, people had not looked closely at this question before our work.

🎖 Honors and Awards

  • 2020.06 Best Paper Nomination at CVPR’20 for Visual Chirality!
  • 2020.05 Graduated Summa Cum Laude in Computer Science and Mathematics from Cornell University, and served as college symbol bearer (top 5 of the college).

đź“– Educations

  • 2020.09 - (now), PhD student, Carnegie Mellon University.
  • 2016.09 - 2020.06, Undergraduate, Cornell University.

đź’¬ Invited Talks

đź’» Services

  • Organizer: CVPR’22 VPLOW Workshop (Challenge Track)
  • Reviewer: ECCV, CVPR (Outstanding reviewer), ICCV, NeurIPS, ICML.
  • Teaching (CMU): Learning-based Image Synthesis and Advanced Computer Vision
  • Teaching (Cornell): Advanced Machine Learning, Cornell Tech Pre-Master Program, Functional Programming, Algorithm Analysis, Data Structures, Computer Vision