Before AI can generate professional videos, it needs to see like a professional.
A year with 100+ content creators teaching AI to describe video like a filmmaker — with a structured specification, scalable human–AI oversight, and post-training that lets an 8B model surpass GPT-5 and Gemini-3.1.
1Carnegie Mellon University 2Harvard University 3Apple
VLMs already write fluent captions — but they hallucinate what they see, miss how the camera moves, and confuse left from right. Humans, on the other hand, make typos, describe events out of order, and lack the vocabulary for cinematography. CHAI combines their strengths: models write, humans critique.
A structured video language spanning subjects, scenes, motion, spatial framing, and camera dynamics — grounded by 200+ visual primitives co-designed with professional cinematographers.
LLMs write more fluently than most humans, but hallucinate what they see. CHAI lets AI draft captions and humans critique — shifting effort from generation to verification.
Our Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5. Human critiques turn AI drafts into accurate captions, yielding signals for SFT, DPO, reward modeling, and inference-time scaling.
We fine-tune Wan to follow our detailed cinematic prompts of up to 400 words — precise control over dolly zoom, rack focus, speed ramps, Dutch angles, POVs, and more.
Figure 1. Our recipe for precise video language. Red (top): prior work lacks specification and oversight, leading to imprecise terminology, hallucinations, and poor writing. Blue (bottom): CHAI combines structured specification, critique-based oversight, and post-training — which in turn unlock professional-quality video generation.
Without clear specification, video captions use imprecise terms (confusing dolly-in with zoom-in), miss key details (camera shake, focus shifts), and inject subjective language ("inspiring atmosphere") instead of describing what's visible. Filmmakers don't have this problem. They use precise terms like rack focus, Dutch angle, and medium full shot to coordinate on set. We formalized this shared vocabulary.
Figure 2. Prior datasets (red) suffer from imprecise terminology, missing information, and subjective descriptions. Our specification (blue) is co-designed with 100+ professional creators over a year-long collaboration.
A structured specification covering 5 aspects:
🧑 Subject · 🏞️ Scene · 🏃 Motion · 📐 Spatial · 🎥 Camera
Figure 3. The full taxonomy — each of the 5 aspects is decomposed into sub-aspects, grounded by 200+ visual and motion primitives.
Specification tells you what to describe. But who does the writing? Humans write captions with typos, grammar errors, and events out of order. Models write fluently but hallucinate objects and motion that don't exist. Both confuse left vs. right. CHAI's insight: LLMs are already better writers than most humans, but humans are better at spotting visual errors — so let each do what they're best at.
Figure 4. The CHAI framework. 🤖 AI writes comprehensive pre-captions. 👤 Human experts critique what's wrong and how to fix it, guiding AI to produce accurate post-captions. ✅ Peer-review bonuses reward annotators for precision and reviewers for catching errors.
images/oversight.png with the updated figure when ready.
CHAI produces (pre-caption, critique, post-caption) triplets — unlocking three capabilities at once: captioning, reward modeling, and critique generation. The catch? Critique quality is the bottleneck.
Figure 5. A good critique is accurate, complete, and constructive. CHAI enforces all three properties by requiring critiques to directly guide model revision.
| Critique Type | Prec. | Rec. | Constr. | Caption | Reward |
|---|---|---|---|---|---|
| Blind Gemini-2.5 | — | — | — | 10.2 | 43.0 |
| Gemini-2.5 (w/ video) | — | — | — | 11.9 | 59.9 |
| Inaccurate | ✗ | ✓ | ✓ | 11.3 | 45.5 |
| Incomplete | ✓ | ✗ | ✓ | 11.7 | 54.7 |
| Non-constructive | ✓ | ✓ | ✗ | 12.5 | 65.0 |
| Ours (w/o QC) | — | — | — | 13.8 | 70.7 |
| Ours (w/ QC) | ✓ | ✓ | ✓ | 17.0 | 86.8 |
Weakening any single quality dimension degrades post-training. Prior work (OpenAI GDC, MM-RLHF) collects 50%+ non-constructive critiques — just "this is wrong" without explaining the fix.
We re-caption large-scale professional videos (films, commercials, music videos, games) with our post-trained model and fine-tune Wan2.2 to follow detailed prompts of up to 400 words — achieving precise control over techniques current generators struggle with.
Figure 6. After fine-tuning on re-captioned data, Wan2.2 follows detailed prompts more faithfully, with finer control over camera motion, cinematography, and visual composition.











CHAI is one pillar of our research program on precise video language for professional video understanding and generation.
🚀 We are actively advancing CHAI with larger-scale datasets and stronger video understanding models.
We welcome collaborations and funding opportunities with researchers and practitioners in video understanding, captioning, and multimodal agents for professional-level video content.
If you're interested in accessing improved data or models, please reach out at zhiqiulin98@gmail.com or cmitra@andrew.cmu.edu, or open a GitHub Issue.
Before training professionals, we tried crowdsourced workers with film knowledge — they still lack the vocabulary for basic cinematography. We also evaluated 8 video–text datasets (2016–2025) and found recurring issues from lack of specification and oversight.
Crowdworkers still confuse dolly-in with zoom-in, call full shots "close-ups," and describe fisheye distortion as "circular buildings." These errors motivated our collaboration with professional creators.
Figure 7. Crowdsourced annotators lack the visual vocabulary for common cinematic and motion effects.
Even recent benchmarks suffer from imprecise terminology, missing information, subjective descriptions (lack of specification), as well as poor writing, visual hallucinations, and inaccurate details (lack of oversight).
Figure 8. Errors caused by lack of specification: imprecise terminology, missing information, subjective descriptions.
Figure 9. Errors caused by lack of oversight: poor writing, visual hallucinations, inaccurate details.
images/errors_specification.png and images/errors_oversight.png
from the paper's appendix (Figures S2 & S3). Placeholders above will auto-swap once images are dropped into
images/.
@inproceedings{lin2026chai,
title={Building a Precise Video Language with Human-AI Oversight},
author={Zhiqiu Lin and Chancharik Mitra and Siyuan Cen and Isaac Li
and Yuhan Huang and Yu Tong Tiffany Ling and Hewei Wang
and Irene Pi and Shihang Zhu and Ryan Rao and George Liu
and Jiaxi Li and Ruojin Li and Yili Han and Yilun Du
and Deva Ramanan},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}