CHAI: Building a Precise Video Language with Human

Overview

A recipe for precise video language

We built a language of cinema for AI in four pieces: a structured specification for what to describe, a scalable human–AI oversight framework for how to annotate it at quality, post-training recipes that turn those annotations into a model that beats GPT-5 and Gemini-3.1, and a re-captioned video corpus that unlocks professional-level video generation.

Precise Specification

A structured video language spanning subjects, scenes, motion, spatial framing, and camera dynamics — grounded by 200+ visual primitives co-designed with professional cinematographers.

Scalable Oversight

LLMs write more fluently than most humans, but hallucinate what they see. CHAI lets AI draft captions and humans critique — shifting effort from generation to verification.

Post-Training

Our Qwen3-VL-8B surpasses Gemini-3.1 and GPT-5. Human critiques turn AI drafts into accurate captions, yielding signals for SFT, DPO, reward modeling, and inference-time scaling.

Better Generation

We fine-tune Wan to follow our detailed cinematic prompts of up to 400 words — precise control over dolly zoom, rack focus, speed ramps, Dutch angles, POVs, and more.

Figure 1. Our recipe for precise video language. Red (top): prior work lacks specification and oversight, leading to imprecise terminology, hallucinations, and poor writing. Blue (bottom): CHAI combines structured specification, critique-based oversight, and post-training — which in turn unlock professional-quality video generation.

1 Specification

Teaching AI the shared language of cinema

Without clear specification, video captions use imprecise terms (confusing dolly-in with zoom-in), miss key details (camera shake, focus shifts), and inject subjective language ("inspiring atmosphere") instead of describing what's visible. Filmmakers don't have this problem. They use precise terms like rack focus, Dutch angle, and medium full shot to coordinate on set. We formalized this shared vocabulary.

Specification

Figure 2. Prior datasets (red) suffer from imprecise terminology, missing information, and subjective descriptions. Our specification (blue) is co-designed with 100+ professional creators over a year-long collaboration.

A structured specification covering 5 aspects:
🧑 Subject · 🏞️ Scene · 🏃 Motion · 📐 Spatial · 🎥 Camera

The Full Taxonomy

Figure 3. The full taxonomy — each of the 5 aspects is decomposed into sub-aspects, grounded by 200+ visual and motion primitives.

2 Oversight

AI writes. Humans critique.

Specification tells you what to describe. But who does the writing? Humans write captions with typos, grammar errors, and events out of order. Models write fluently but hallucinate objects and motion that don't exist. Both confuse left vs. right. CHAI's insight: LLMs are already better writers than most humans, but humans are better at spotting visual errors — so let each do what they're best at.

Oversight

Figure 4. The CHAI framework. 🤖 AI writes comprehensive pre-captions. 👤 Human experts critique what's wrong and how to fix it, guiding AI to produce accurate post-captions. ✅ Peer-review bonuses reward annotators for precision and reviewers for catching errors.

↔ SWIPE THROUGH (PRE-CAPTION · CRITIQUE · POST-CAPTION) EXAMPLES

Subject

Subject — type, appearance, pose, relationships

Scene

Scene — POV, setting, overlays, time of day

Motion

Motion — actions, interactions, group dynamics

Spatial

Spatial — shot size, frame position, depth, movement

Camera

Camera — angle, height, lens, focus, steadiness, movement

3 Post-Training

An 8B model beats GPT-5 and Gemini-3.1

CHAI produces (pre-caption, critique, post-caption) triplets — unlocking three capabilities at once: captioning, reward modeling, and critique generation. The catch? Critique quality is the bottleneck.

Critique Quality

Figure 5. A good critique is accurate, complete, and constructive. CHAI enforces all three properties by requiring critiques to directly guide model revision.

Critique Type	Quality Metrics			Task Performance
Critique Type	Prec.	Rec.	Constr.	Caption	Reward	Critique
Blind Gemini-2.5	—	—	—	10.9	44.5	21.1
Gemini-2.5	—	—	—	12.7	62.0	26.2
Inaccurate critique	✗	✓	✓	12.1	47.1	21.9
Incomplete critique	✓	✗	✓	12.5	56.6	28.7
Unhelpful critique	✓	✓	✗	13.4	67.2	32.9
Ours (w/o quality check)	—	—	—	14.8	73.1	35.7
Ours (w/ quality check)	✓	✓	✓	18.2	89.8	41.7

Critique quality determines post-training success. Weakening any single quality dimension (precision, recall, constructiveness) degrades post-training. Prior work (OpenAI GDC, MM-RLHF) collects 50%+ non-constructive critiques — just "this is wrong" without explaining the fix.

    Key Findings
    Explicit preference + critique supervision improves SFT and RL — Qwen3-VL-8B outperforms Gemini-3.1-Pro
Critique quality (precision, recall, constructiveness) is the bottleneck for post-training success
Inference-time scaling via best-of-N with reward models yields further gains — no extra human labels

  

4 Application

Better understanding → better generation

We re-caption large-scale professional videos (films, commercials, music videos, games) with our post-trained model and fine-tune Wan2.2 to follow detailed prompts of up to 400 words — achieving precise control over techniques current generators struggle with.

Generation Examples

Generation Examples

Figure 6. After fine-tuning on re-captioned data, Wan2.2 follows detailed prompts more faithfully, with finer control over camera motion, cinematography, and visual composition.

↔ SWIPE THROUGH 11 CINEMATIC CONTROL EXAMPLES

Dolly Zoom Out

Dolly zoom out — Hitchcock's vertigo effect

Isometric (2.5D)

Isometric (2.5D) — game perspective

Clockwise Camera Roll

Clockwise camera roll

Dutch Angle

Dutch angle — tilted horizon

Rack Focus

Rack focus — shifting between subjects

Speed Ramp

Speed ramp — mixing fast and slow

Side-View

Side-view game perspective

Watermark Overlay

Watermark overlay

Underwater → Above Water

Underwater → above water transition

Medium Shot → Close-Up

Medium shot → close-up

Revealing Shot

Revealing shot

Ecosystem

Part of a broader effort

CHAI is one pillar of our research program on precise video language for professional video understanding and generation.

🚀
We are actively advancing CHAI with larger-scale datasets and stronger video understanding models.

We welcome collaborations and funding opportunities with researchers and practitioners in video understanding, captioning, and multimodal agents for professional-level video content.

If you're interested in accessing improved data or models, please reach out at zhiqiulin98@gmail.com or chancharikm@gmail.com, or open a GitHub Issue.

Why CHAI?

Where prior captioning falls short

Before hiring trained professionals, we tried crowdsourced workers — they lack the vocabulary for basic cinematography. We also evaluated 8 video–text datasets (2016–2025) and found recurring issues from lack of specification and oversight.

Crowdsourced Annotators Seeing ≠ knowing how to describe

Crowdworkers still confuse dolly-in with zoom-in, call full shots "close-ups," and describe fisheye distortion as "circular buildings." These errors motivated our collaboration with professional creators.

Crowdsourced Annotators' Examples

Figure 7. Crowdsourced annotators lack the visual vocabulary for common cinematic and motion effects.

Prior Datasets (2016–2025) Recurring issues from missing specification & oversight

Even recent benchmarks suffer from imprecise terminology, missing information, subjective descriptions (lack of specification), as well as poor writing, visual hallucinations, and inaccurate details (lack of oversight).

Examples of Lacking of Specification

Figure 8. Errors caused by lack of specification: imprecise terminology, missing information, subjective descriptions.

Examples of Lacking of Oversight

Figure 9. Errors caused by lack of oversight: poor writing, visual hallucinations, inaccurate details.

Citation

BibTeX

@inproceedings{lin2026chai,
  title={Building a Precise Video Language with Human-AI Oversight},
  author={Zhiqiu Lin and Chancharik Mitra and Siyuan Cen and Isaac Li
          and Yuhan Huang and Yu Tong Tiffany Ling and Hewei Wang
          and Irene Pi and Shihang Zhu and Ryan Rao and George Liu
          and Jiaxi Li and Ruojin Li and Yili Han and Yilun Du
          and Deva Ramanan},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Precise Specification

Scalable Oversight

Post-Training

Better Generation

CameraBench-Pro (via Moodio)

Key Findings

Moodio & CameraBench-Pro

CameraBench NeurIPS'25 Spotlight · Top 3%

Crowdsourced Annotators Seeing ≠ knowing how to describe

Prior Datasets (2016–2025) Recurring issues from missing specification & oversight