Preprint
Co-Pilot for Video Production
When people ask me if I went to film school, I tell them, ‘no, I went to films.’

Moodio: Making Anyone a
Professional Video Studio

A reference-first AI film studio. Discover, learn from, and reuse professional cinematic clips — all in one place. At its core: a cinematic video search engine that outperforms Gemini Embedding 2 and Qwen3-VL Embedding.

Ryan Rao1,*, Ruihuang Yang1,*, George Liu1,*, Chancharik Mitra1, Siyuan Cen1, Yuhan Huang1, Yu Tong Tiffany Ling1, Shihang Zhu1, Jiaxi Li1, Ruojin Li1, Hewei Wang1, Yili Han1, Yilun Du2, Graham Neubig1, Deva Ramanan1, Zhiqiu Lin1

1Carnegie Mellon University  ·  2Harvard University

*Equal contribution as co-first authors

We find that Moodio's retrieve-then-generate workflow yields more compelling final videos with fewer prompt iterations than text-only generation. Want to see for yourself?

A reference-first workflow for cinematic video creation

In filmmaking, visual references are everything. They spark inspiration before you create, they teach techniques by example, and over time they're how you build visual taste — a craft no amount of prompt engineering can substitute for.

But the tools that support this — Shotdeck, Vimeo, and the rest — are stuck on metadata, titles, and tags. Moodio rebuilds the cinematic search engine with modern vision-language models, then plugs it directly into the AI video production workflow through three agent-guided stages: Inspire → Learn → Generate.

Does it actually help?
In a controlled A/B study with 30 active creators, Moodio's retrieve-then-generate workflow produces more compelling final videos, incorporates roughly more cinematic techniques, and cuts prompt iterations by 55%, which roughly halves total wall-clock per finished video.

1

Chat-to-Inspire

Natural-language search over fine-grained cinematic content — not just titles and tags, but camera movement, shot composition, focus shifts, and temporal dynamics.

2

Chat-to-Learn

Chat with the agent to understand why a shot works. Inspect predicted cinematic tags and retrieve similar examples for further study.

3

Chat-to-Generate

Reuse a reference's structure across all five aspects — subject, scene, motion, spatial layout, and camera — into a generation prompt for your own content.

Figure 1. The Moodio user journey. A creator without formal film training (Leo, apparel designer) progresses through three reference-first stages: Chat-to-Inspire (browse cinematic references in natural language), Chat-to-Learn (understand why a reference works via predicted cinematic tags), and Chat-to-Create (transfer its structure into the user's own generation).

Great videos come from great references — not great prompts

Before shooting, every filmmaker studies existing work: browsing Shotdeck, watching movies, building mood boards. This reference-first process is how visual taste develops, yet it's completely absent from AI video tools. Moodio brings it back. Search for "rack focus from foreground treat to background dog with speed ramp" and actually find it.

Figure 2. Left: the Chat-to-Inspire interface. Users issue natural-language queries using cinematic terminology ("camera starts focused on a person in the foreground, then rack focuses to someone in the background") or temporal dynamics ("shot starts indoor and transitions to outdoor"); Moodio surfaces grounded reference clips. Right: the 5-aspect caption schema powering retrieval, grounded by the CameraBench-Pro skill taxonomy.

Moodio's structured retrieval speaks the language of cinema across 5 aspects:
🧑 Subject · 🏞️ Scene · 🏃 Motion · 📐 Spatial · 🎥 Camera

🧑
Subject
type, appearance, pose, relationships
🏞️
Scene
POV, overlays, setting, time of day
🏃
Motion
actions, interactions, group dynamics
📐
Spatial
shot size, position, depth, movement
🎥
Camera
height, angle, lens, focus, movement
Understand why a shot works

Found an inspiring reference? Now learn why it works. Moodio displays predicted cinematic tags for each video and lets you chat with the agent to understand how the shot is constructed — what camera movement, angle, focus, and framing choices make it effective. You can also ask "show me more videos with the same camera movement" and the agent retrieves similar examples for further inspiration.

Figure 3. Left (Watch-to-Learn): selecting a reference surfaces structured metadata across all 5 aspects — camera movement, steadiness, height, angle, focus, shot type, shot size, playback speed, POV — up to 225 cinematic tags per video. Right (Chat-to-Learn): a conversational agent explains why the shot works (here, Star-Lord's dolly zoom creating vertigo) and retrieves more shots with matching camera movement on demand.

Turn references into your own generations

Instead of writing cinematic instructions from scratch, select a reference video whose structure you want to reuse. Moodio transfers its subject, scene, motion, spatial layout, and camera behavior into a generation prompt for your own content — across twelve state-of-the-art generators including Seedance, Kling, Wan, Veo, and Sora.

Figure 4. Three reference-to-generation transfers. In each scenario the user selects a reference clip (Selected Video) and their own asset (User's Asset), then describes the edit conversationally. Moodio preserves the reference's cinematic structure — rolling top-down motion with rack focus (pet treat), fly-through light-ring transition (VR headset), water-surface tracking shot (phone ad) — while swapping in the user's subject.

What powers Moodio

Three contributions: a structured retrieval method that matches generative reranker quality at scale, the largest cinematic taxonomy to date (CameraBench-Pro), and a 50K-query benchmark covering single keywords through 400-word cinematic scripts in English and Chinese.

Method · Structured Retrieval Five aspects, encoded separately, max-pooled at query time

Each video is captioned along five aspects — subject, scene, motion, spatial, camera — and each caption is encoded into its own Qwen3-Embedding-8B vector. At query time we embed the user query once and score each video by the max cosine similarity across its five vectors, so the query aligns to whichever aspect it most directly targets. All vectors are pre-indexed offline.

Strategy Caption Source Caption Type Scalable Time/Query Hit MRR@10
@1 @5 @10
Off-the-shelf models
Gemini Embedding 2
Qwen3-VL Embedding
Qwen3-VL Reranker
Qwen3 Embedding Qwen3-VL Single-Caption
Qwen3 Embedding Qwen3-VL Structured-Caption
Using Qwen3-VL-SFT captions (ours)
Qwen3 Embedding Qwen3-VL-SFT Single-Caption
Qwen3 Embedding Qwen3-VL-SFT Structured-Caption

Table 1. Retrieval on the Moodio-T2V benchmark. Our 5-aspect structured-caption retrieval (highlighted) outperforms embedding baselines (Gemini Embedding 2, Qwen3-VL Embedding) and matches Qwen3-VL Reranker quality orders of magnitude faster, since Reranker requires a forward pass per (video, query) pair and is infeasible at million-video scale. Numbers will be filled in upon release.

Data · CameraBench-Pro 225 cinematic primitives across 17 skill categories

The largest professionally-annotated cinematic video understanding dataset to date — a 4× expansion of CameraBench (NeurIPS '25 Spotlight) from camera motion to the full language of cinema. Co-designed with 100+ professional filmmakers over a year.

225
Cinematic primitives
across 17 skill categories
1.5M+
Professional annotations
over 10K+ videos
100+
Filmmakers
year-long collaboration
50K
Benchmark queries
EN + ZH, keywords → scripts
CameraBench-Pro camera taxonomy — lens distortion, Dutch angle, height, angle, focus, depth of field
CameraBench-Pro shot taxonomy — POV, shot types, transitions, subject framing, playback speed, shot size, overlays

Figure 5. Excerpts from the CameraBench-Pro taxonomy. Left: camera-side primitives (lens distortion, Dutch angle, camera angle, camera height, focus & depth of field). Right: scene-side primitives (point of view, shot types, shot transitions, subject framing, playback speed, shot size, overlays).

Evaluation · Moodio-T2V Benchmark 50K bilingual queries, simulated from real creator search behavior

Prior text-to-video benchmarks (MSR-VTT, LSMDC, DiDeMo) use short, stylistically uniform English queries with one-to-one query-video mappings. Moodio-T2V instead spans single keywords through 400-word cinematic descriptions across four framings (concise / question / request / verbose) and two languages — with pooled relevance sets so multiple videos can satisfy the same intent.

Figure 6. The Moodio-T2V query simulator. Each held-out video is decomposed into a grounded 5-aspect caption, sampled into structured query contents, then rendered into eight surface forms (concise / question / request / verbose × EN / ZH) — yielding 50K diverse queries.

Part of a broader effort

Moodio is one pillar of our research program on precise video language for professional video understanding and generation. It builds on CameraBench and shares its annotation platform and post-training recipes with CHAI.

🚀 We are actively advancing Moodio with richer taxonomies, stronger video understanding models, and a growing corpus of professional references.

We welcome collaborations and funding opportunities with researchers and practitioners in video understanding, cinematic generation, and multimodal agents for professional-level video content.

If you're interested in accessing improved data or models — or in partnering on the Moodio platform — please reach out at zhiqiulin98@gmail.com, try the live demo at app.moodio.art, or open a GitHub Issue.

From a blank prompt to a cinematic promo

Two end-to-end reference-first journeys from real Moodio users — different domains, same workflow.

Example 1 · Museum exhibition A VR tour of ancient Babylon

Figure 7. A museum exhibition designer searches "entering a new world with VR glasses" to find references (Chat-to-Inspire), then asks Moodio what makes the selected clip feel cinematic — learning that a first-person POV, a dolly-forward through a tunnel, and heavy motion blur converge toward a vanishing point (Chat-to-Learn). Finally, he reuses that POV and warp-tunnel structure but rebuilds the tunnel walls from archaeological excavation layers, bursting into the Babylonian Empire at its peak (Chat-to-Create).

Example 2 · Pet industry marketing A short product video for dog treats

Figure 8. A pet industry marketing specialist wants to create short product videos that make dog treats look irresistible but lacks a professional crew. (Chat-to-Inspire) The user searches for clips of pets going crazy for treats and selects a high-energy reference. (Chat-to-Learn) The user asks what cinematic effects are used; the agent explains the rack focus with shallow depth of field shifting from the treat in the foreground to the dog in the background, combined with a speed ramp from slow motion to normal speed as the dog catches the treat. (Chat-to-Create) The user applies the same camera angle and effects to their own treat image, adding more dogs entering the frame and a camera roll for extra energy.

Open release Everything we built

We will release the full pipeline: CameraBench-Pro dataset, the Moodio-T2V benchmark, structured retrieval models, the annotation platform, aggregated anonymized user logs, and all evaluation code. Try the live system at app.moodio.art.

BibTeX
@misc{rao2026moodio,
  title={Moodio: Making Anyone a Professional Video Studio},
  author={Ryan Rao and Ruihuang Yang and George Liu
          and Chancharik Mitra and Siyuan Cen and Yuhan Huang
          and Shihang Zhu and Jiaxi Li and Ruojin Li
          and Hewei Wang and Yu Tong Tiffany Ling and Yili Han
          and Yilun Du and Graham Neubig and Deva Ramanan
          and Zhiqiu Lin},
  year={2026},
  note={Under review}
}
Author Contributions Who built what

Moodio is a deeply collaborative project. Below is a condensed contribution map; component leads reflect substantial sustained effort beyond the listed tasks. Many components also depend on the concurrent CHAI work that powers Moodio's data, captioning, and reward modeling. Ryan Rao*, Ruihuang Yang*, and George Liu* contributed equally as co-first authors.

Retrieval System
Five-aspect structured retrieval with max-cosine scoring, the user-query simulator and 50K bilingual benchmark, pooled relevance set construction, and the side-by-side user study against Shotdeck.
Leads: Ryan Rao*, George Liu*
Moodio Platform
End-to-end Next.js full-stack platform, GPT-5.3-backed agent orchestrating the three-stage workflow, twelve-model generation integration, asset management, video assembly, and A/B testing infrastructure.
Lead: Ruihuang Yang*  ·  Tag Calibration & Corpus: Ryan Rao*, Zhiqiu Lin, George Liu*
Taxonomy & Captioning Policy
225-primitive cinematic taxonomy across 17 skill categories, decision rules, and the five-aspect captioning framework (subject · scene · motion · spatial · camera) with 35 fine-grained subaspects.
Leads: Zhiqiu Lin, Yuhan Huang  ·  Contributors: Yu Tong Tiffany Ling, Siyuan Cen, Shihang Zhu, Yili Han
Annotation Operations
Labeling/captioning platform engineering, annotator recruitment and training (600+ applicants, six-round qualification exam), and execution of 1.5M+ primitive labels and ~20K caption triplets.
Lead: Zhiqiu Lin  ·  QC Co-Leads: Yuhan Huang, Shihang Zhu  ·  Core Annotators: Hewei Wang, Jiaxi Li, Ruojin Li, Isaac Li, Irene Pi
Video Understanding Models
Primitive classifier (Qwen3-VL-8B fine-tuned on 1.2M VQA samples), structured caption generator (SFT/DPO/RLHF-V/GSPO), critique-quality ablations, reward modeling, and inference-time scaling.
Leads: Chancharik Mitra, George Liu*
Reference-Guided Generation
Re-captioning and Wan2.2-T2V-14B LoRA fine-tuning that supports 400-word professional prompts, with human Likert evaluation showing substantial gains over base Wan and zero-shot-caption baselines.
Lead: Siyuan Cen  ·  Contributor: Zhiqiu Lin
User Research
Interviews with active Moodio users and Shotdeck professionals; design of the 5-user side-by-side study; qualitative feedback on long-form queries, motion search, and cinematic technique retrieval.
Lead: Yili Han  ·  Contributors: Yu Tong Tiffany Ling, Zhiqiu Lin
Writing & Figures
Drafting of the Moodio and CHAI papers, appendices, taxonomy definitions, and annotation tutorials; design of all teaser figures, workflow diagrams, taxonomy visualizations, and UI screenshots.
Writing Lead: Zhiqiu Lin  ·  Figures Lead: Yu Tong Tiffany Ling  ·  Contributors: All team members
Research Vision
Reframing AI video creation from prompt engineering to learning by watching; grounding professional video understanding in a precise shared cinematic language.
Driver: Zhiqiu Lin  ·  Primary Advisor: Deva Ramanan  ·  Additional Advisors: Graham Neubig, Yilun Du