CameraBench: Towards Understanding Camera Motions in Any Video

1CMU
, 2UMass Amherst
, 3USC
, 4Emerson
, 5Adobe
, 6Harvard
, 7MIT-IBM

Abstract

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.

Why CameraBench?

We must perceive in order to move, but we must also move in order to perceive.

Humans perceive the visual world through movement. Motion parallax enables precise depth perception essential for navigating the physical world. Similarly, camera motion is crucial for modern vision techniques that process videos of dynamic scenes. For example, Structure-from-Motion (SfM) and Simultaneous Localization and Mapping (SLAM) methods must first estimate camera motion (pose trajectory) to reconstruct the scenes in 4D. Likewise, without understanding camera motion, video-language models (VLMs) would not fully perceive, reason about, or generate video dynamics.

Understanding camera motion comes naturally to humans because we intuitively grasp the "invisible subject" -- the camera operator who shapes the video's viewpoint, framing, and narrative. For example, in a video tracking a child's first steps, one can feel a parent's joy and excitement through their handheld, shaky movement. Professional cinematographers and filmmakers use camera motion as a tool to enhance visual storytelling and amplify the emotional impact of their shots. Hitchcock's iconic dolly zoom moves the camera forward while zooming out, maintaining the subject's framing while altering the background to create the impression of vertigo. In Jurassic Park (1993), Spielberg uses a slow upward tilt and rightward pan to evoke a sense of awe as the protagonists (and the audience) first see the dinosaurs. In Inception (2010), Nolan uses a camera roll to mirror shifting gravity, blurring the line of reality. Similarly, game developers use camera movement to enhance player immersion. In Legend of Zelda: Breath of the Wild (2017), a smooth pedestal-up shot transitions from the character's viewpoint to a breathtaking aerial view, hinting at the journey ahead. Even amateur photographers use camera motion as a tool; for example, selfie videos allow one to play the role of both the cinematographer and the subject.

Examples of camera movements

A Taxonomy of Camera Motion Primitives

Taxonomy of camera motion primitives

Understanding camera motion requires capturing both its geometry (e.g., trajectory) and semantics (e.g., shot intent and filming context). To enable human-like perception of camera motion through data-driven approaches, we collaborate with vision researchers and professional cinematographers to develop a precise taxonomy of camera motion primitives in an iterative process – sourcing diverse internet videos, annotating them, and refining the taxonomy to address missing labels and improve consensus.

Our final taxonomy includes three reference frames (object-, ground-, and camera-centric) and defines key motion types, including translation (e.g., upward), rotation (e.g., roll clockwise), intrinsic changes (e.g., zoom-in), circular motion (e.g., arcing), steadiness (e.g., shaky), and tracking shots (e.g., side-tracking):

Taxonomy of camera motion primitives

Scaling High-Quality Human Annotations

To precisely annotate complex camera motion in real-world videos, we refine our taxonomy iteratively with feedback from film industry experts over months. We also design a hybrid label-then-caption framework. Annotators first determine whether the motion is clear and consistent. For clear motion, they directly classify each motion primitive (e.g., pan-left, no-tilt). For ambiguous or conflicting motion, they only label confident aspects and leave others as "I am not sure," followed by a natural language description (e.g., "The camera first pans left, then right" or "The background is too dark to perceive any movement"). They are also encouraged to describe why the camera moves—e.g., to follow a subject or enhance immersion. To ensure quality, we conduct a large-scale human study with over 100 participants from diverse backgrounds, finding that professional cinematographers consistently outperform non-experts. This inspires us to implement detailed guidelines and a multistage training program to improve the accuracy of both novice and expert annotators.

Annotation pipeline
Training results
Our training program improves the accuracy of both expert and non-expert annotators by 10–15%.

CameraBench

We introduce CameraBench, a large-scale dataset with over 150K binary labels and captions over ~3,000 videos spanning diverse types, genres, POVs, capturing devices, and post-production effects (e.g., nature, films, games, 2D/3D, real/synthetic, GoPro, drone shot, etc.). We showcase example annotations below:

Image illustrating Tags

These annotations allow us to evaluate and improve the performance of SfMs and VLMs on a wide range of tasks (video-text retrieval, video captioning, video QA, etc.) that require both geometric and semantic understanding of camera motion. We show example video QA tasks below:

Image illustrating Video QA 1
Image illustrating Video QA 2
Image illustrating Video QA 3

SfMs vs. VLMs on CameraBench

We highlight the following key findings:

  • Recent learning-based SfM/SLAM methods like MegaSAM and CuT3R achieve superior performance across most motion primitives, significantly outperforming classic methods like COLMAP. Nontheless, SfMs are still far from solving this task. We show failure cases of SfM methods below:
  • Image illustrating Performance of SfMs and VLMs

    Left: A lead-tracking shot where the camera moves backward as the subject walks forward. Due to unchanged subject framing and lack of distinct background textures, MegaSAM fails to detect camera translation and COLMAP crashes.

    Right: A roll-clockwise shot in a low-parallax scene where both MegaSAM and COLMAP fail to converge and output random trajectories with nonexistent motion.

  • Although generative VLMs (evaluated using VQAScore) are weaker than SfM/SLAM, they generally outperform discriminative VLMs that use CLIPScore/ITMScore. Furthermore, they are able to capture the semantic primitives that depend on scene content, while SfMs struggle to do so. Motivated by this, we apply supervised fine-tuning (SFT) to a generative VLM (Qwen2.5-VL) on a separately annotated training set of ~1400 videos. We show that simple SFT on small-scale (yet high-quality) data significantly boosts performance by 1-2x, making it match the SOTA MegaSAM in overall AP.
  • Image illustrating Performance of SfMs and VLMs

Motion Augmented Video Captioning

Lastly, we show that Qwen-2.5-VL finetuned on our high-quality dataset can generate more accurate camera motion captions than the state-of-the-art generative VLMs such as GPT-4o and Gemini-2.5-Pro.

Video caption example 1
Video caption result 1
Video caption example 2
Video caption result 2
Video caption example 3
Video caption result 3

BibTeX

@inproceedings{camerabench,
        title={CameraBench: Towards Understanding Camera Motions in Any Video},
        author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Yu Tong Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
        booktitle={arXiv preprint arXiv:2504.15376},
        year={2025},
        }