CVPR 2026 Highlight
Top 3%

Before AI can generate professional videos, it needs to see like a professional.

Building a Precise Video Language
with Human–AI Oversight

A year with 100+ content creators teaching AI to describe video like a filmmaker — with a structured specification, scalable human–AI oversight, and post-training that lets an 8B model surpass GPT-5 and Gemini-3.1.

Zhiqiu Lin1, Chancharik Mitra1, Siyuan Cen1, Isaac Li1, Yuhan Huang1, Yu Tong Tiffany Ling1, Hewei Wang3, Irene Pi1, Shihang Zhu1, Ryan Rao1, George Liu1, Jiaxi Li1, Ruojin Li1, Yili Han1, Yilun Du2, Deva Ramanan1

1Carnegie Mellon University    2Harvard University    3Apple

A recipe for precise video language

VLMs already write fluent captions — but they hallucinate what they see, miss how the camera moves, and confuse left from right. Humans, on the other hand, make typos, describe events out of order, and lack the vocabulary for cinematography. CHAI combines their strengths: models write, humans critique.

1

Precise Specification

A structured video language spanning subjects, scenes, motion, spatial framing, and camera dynamics — grounded by 200+ visual primitives co-designed with professional cinematographers.

2

Scalable Oversight

LLMs write more fluently than most humans, but hallucinate what they see. CHAI lets AI draft captions and humans critique — shifting effort from generation to verification.

3

Post-Training

Our Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5. Human critiques turn AI drafts into accurate captions, yielding signals for SFT, DPO, reward modeling, and inference-time scaling.

4

Better Generation

We fine-tune Wan to follow our detailed cinematic prompts of up to 400 words — precise control over dolly zoom, rack focus, speed ramps, Dutch angles, POVs, and more.

CHAI Recipe Overview

Figure 1. Our recipe for precise video language. Red (top): prior work lacks specification and oversight, leading to imprecise terminology, hallucinations, and poor writing. Blue (bottom): CHAI combines structured specification, critique-based oversight, and post-training — which in turn unlock professional-quality video generation.

Teaching AI the shared language of cinema

Without clear specification, video captions use imprecise terms (confusing dolly-in with zoom-in), miss key details (camera shake, focus shifts), and inject subjective language ("inspiring atmosphere") instead of describing what's visible. Filmmakers don't have this problem. They use precise terms like rack focus, Dutch angle, and medium full shot to coordinate on set. We formalized this shared vocabulary.

Specification comparison

Figure 2. Prior datasets (red) suffer from imprecise terminology, missing information, and subjective descriptions. Our specification (blue) is co-designed with 100+ professional creators over a year-long collaboration.

A structured specification covering 5 aspects:
🧑 Subject · 🏞️ Scene · 🏃 Motion · 📐 Spatial · 🎥 Camera

CHAI taxonomy

Figure 3. The full taxonomy — each of the 5 aspects is decomposed into sub-aspects, grounded by 200+ visual and motion primitives.

AI writes. Humans critique.

Specification tells you what to describe. But who does the writing? Humans write captions with typos, grammar errors, and events out of order. Models write fluently but hallucinate objects and motion that don't exist. Both confuse left vs. right. CHAI's insight: LLMs are already better writers than most humans, but humans are better at spotting visual errors — so let each do what they're best at.

CHAI Oversight Framework

Figure 4. The CHAI framework. 🤖 AI writes comprehensive pre-captions. 👤 Human experts critique what's wrong and how to fix it, guiding AI to produce accurate post-captions. ✅ Peer-review bonuses reward annotators for precision and reviewers for catching errors.

📌 TODO (Zhiqiu): replace images/oversight.png with the updated figure when ready.
An 8B model beats GPT-5 and Gemini-3.1

CHAI produces (pre-caption, critique, post-caption) triplets — unlocking three capabilities at once: captioning, reward modeling, and critique generation. The catch? Critique quality is the bottleneck.

17.8
Caption BLEU-4
vs. Gemini-3.1 (5.1), GPT-5 (5.5)
86.8
Reward Accuracy
vs. Gemini-2.5 (62.0), GPT-5 (59.5)
27.5
Critique BLEU-4
vs. Gemini-3.1 (3.3), GPT-5 (2.8)
8B
Qwen3-VL parameters
SFT + DPO on CHAI data
Critique quality matters

Figure 5. A good critique is accurate, complete, and constructive. CHAI enforces all three properties by requiring critiques to directly guide model revision.

Critique Type Prec. Rec. Constr. Caption Reward
Blind Gemini-2.510.243.0
Gemini-2.5 (w/ video)11.959.9
Inaccurate11.345.5
Incomplete11.754.7
Non-constructive12.565.0
Ours (w/o QC)13.870.7
Ours (w/ QC)17.086.8

Weakening any single quality dimension degrades post-training. Prior work (OpenAI GDC, MM-RLHF) collects 50%+ non-constructive critiques — just "this is wrong" without explaining the fix.

Key Findings

  • Current VLMs handle subject & scene well but struggle with motion & camera aspects
  • Explicit preference + critique supervision improves SFT and RL — Qwen3-VL-8B outperforms Gemini-3.1-Pro
  • Critique quality (precision, recall, constructiveness) is the bottleneck for post-training success
  • Inference-time scaling via best-of-N with reward models yields further gains — no extra human labels
Better understanding → better generation

We re-caption large-scale professional videos (films, commercials, music videos, games) with our post-trained model and fine-tune Wan2.2 to follow detailed prompts of up to 400 words — achieving precise control over techniques current generators struggle with.

Video generation example 1
Video generation example 2

Figure 6. After fine-tuning on re-captioned data, Wan2.2 follows detailed prompts more faithfully, with finer control over camera motion, cinematography, and visual composition.

Part of a broader effort

CHAI is one pillar of our research program on precise video language for professional video understanding and generation.

🚀 We are actively advancing CHAI with larger-scale datasets and stronger video understanding models.

We welcome collaborations and funding opportunities with researchers and practitioners in video understanding, captioning, and multimodal agents for professional-level video content.

If you're interested in accessing improved data or models, please reach out at zhiqiulin98@gmail.com or cmitra@andrew.cmu.edu, or open a GitHub Issue.

Where prior captioning falls short

Before training professionals, we tried crowdsourced workers with film knowledge — they still lack the vocabulary for basic cinematography. We also evaluated 8 video–text datasets (2016–2025) and found recurring issues from lack of specification and oversight.

Crowdsourced Annotators Seeing ≠ knowing how to describe

Crowdworkers still confuse dolly-in with zoom-in, call full shots "close-ups," and describe fisheye distortion as "circular buildings." These errors motivated our collaboration with professional creators.

Crowdsourced captioning errors

Figure 7. Crowdsourced annotators lack the visual vocabulary for common cinematic and motion effects.

Prior Datasets (2016–2025) Recurring issues from missing specification & oversight

Even recent benchmarks suffer from imprecise terminology, missing information, subjective descriptions (lack of specification), as well as poor writing, visual hallucinations, and inaccurate details (lack of oversight).

Errors from lack of specification

Figure 8. Errors caused by lack of specification: imprecise terminology, missing information, subjective descriptions.

Errors from lack of oversight

Figure 9. Errors caused by lack of oversight: poor writing, visual hallucinations, inaccurate details.

📌 TODO (Zhiqiu): add images/errors_specification.png and images/errors_oversight.png from the paper's appendix (Figures S2 & S3). Placeholders above will auto-swap once images are dropped into images/.
BibTeX
@inproceedings{lin2026chai,
  title={Building a Precise Video Language with Human-AI Oversight},
  author={Zhiqiu Lin and Chancharik Mitra and Siyuan Cen and Isaac Li
          and Yuhan Huang and Yu Tong Tiffany Ling and Hewei Wang
          and Irene Pi and Shihang Zhu and Ryan Rao and George Liu
          and Jiaxi Li and Ruojin Li and Yili Han and Yilun Du
          and Deva Ramanan},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}