We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
We must perceive in order to move, but we must also move in order to perceive.
Humans perceive the visual world through movement. Motion parallax enables precise depth perception essential for navigating the physical world. Similarly, camera motion is crucial for modern vision techniques that process videos of dynamic scenes. For example, Structure-from-Motion (SfM) and Simultaneous Localization and Mapping (SLAM) methods must first estimate camera motion (pose trajectory) to reconstruct the scenes in 4D. Likewise, without understanding camera motion, video-language models (VLMs) would not fully perceive, reason about, or generate video dynamics.
Understanding camera motion comes naturally to humans because we intuitively grasp the "invisible subject" -- the camera operator who shapes the video's viewpoint, framing, and narrative. For example, in a video tracking a child's first steps, one can feel a parent's joy and excitement through their handheld, shaky movement. Professional cinematographers and filmmakers use camera motion as a tool to enhance visual storytelling and amplify the emotional impact of their shots. Hitchcock's iconic dolly zoom moves the camera forward while zooming out, maintaining the subject's framing while altering the background to create the impression of vertigo. In Jurassic Park (1993), Spielberg uses a slow upward tilt and rightward pan to evoke a sense of awe as the protagonists (and the audience) first see the dinosaurs. In Inception (2010), Nolan uses a camera roll to mirror shifting gravity, blurring the line of reality. Similarly, game developers use camera movement to enhance player immersion. In Legend of Zelda: Breath of the Wild (2017), a smooth pedestal-up shot transitions from the character's viewpoint to a breathtaking aerial view, hinting at the journey ahead. Even amateur photographers use camera motion as a tool; for example, selfie videos allow one to play the role of both the cinematographer and the subject.
Understanding camera motion requires capturing both its geometry (e.g., trajectory) and semantics (e.g., shot intent and filming context). To enable human-like perception of camera motion through data-driven approaches, we collaborate with vision researchers and professional cinematographers to develop a precise taxonomy of camera motion primitives in an iterative process – sourcing diverse internet videos, annotating them, and refining the taxonomy to address missing labels and improve consensus.
Our final taxonomy includes three reference frames (object-, ground-, and camera-centric) and defines key motion types, including translation (e.g., upward), rotation (e.g., roll clockwise), intrinsic changes (e.g., zoom-in), circular motion (e.g., arcing), steadiness (e.g., shaky), and tracking shots (e.g., side-tracking):
To precisely annotate complex camera motion in real-world videos, we refine our taxonomy iteratively with feedback from film industry experts over months. We also design a hybrid label-then-caption framework. Annotators first determine
whether the motion is clear and consistent. For clear motion, they directly classify each motion primitive (e.g.,
pan-left
, no-tilt
). For ambiguous or conflicting motion, they only label confident aspects and
leave others as "I am not sure," followed by a natural language description (e.g., "The camera first
pans left, then right" or "The background is too dark to perceive any movement"). They are also encouraged to describe why the camera moves—e.g., to follow a subject or enhance immersion. To
ensure quality, we conduct a large-scale human study with over 100 participants from diverse backgrounds, finding that professional cinematographers consistently outperform non-experts. This inspires us to implement detailed guidelines and a multistage training program to improve the accuracy of both novice and expert annotators.
We introduce CameraBench, a large-scale dataset with over 150K binary labels and captions over ~3,000 videos spanning diverse types, genres, POVs, capturing devices, and post-production effects (e.g., nature, films, games, 2D/3D, real/synthetic, GoPro, drone shot, etc.). We showcase example annotations below:
These annotations allow us to evaluate and improve the performance of SfMs and VLMs on a wide range of tasks (video-text retrieval, video captioning, video QA, etc.) that require both geometric and semantic understanding of camera motion. We show example video QA tasks below:
We highlight the following key findings:
Left: A lead-tracking
shot where the camera moves backward as the subject walks forward. Due
to unchanged subject framing and lack of distinct background textures, MegaSAM fails to detect camera
translation and COLMAP crashes.
Right: A roll-clockwise
shot in a low-parallax scene where both MegaSAM and COLMAP fail to
converge and output random trajectories with nonexistent motion.
Lastly, we show that Qwen-2.5-VL finetuned on our high-quality dataset can generate more accurate camera motion captions than the state-of-the-art generative VLMs such as GPT-4o and Gemini-2.5-Pro.
@inproceedings{camerabench,
title={CameraBench: Towards Understanding Camera Motions in Any Video},
author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Yu Tong Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
booktitle={arXiv preprint arXiv:2504.15376},
year={2025},
}