3D 기하학 정보로 VideoLLM의 카메라 움직임 이해력 향상시키기

Geometry-Guided Camera Motion Understanding in VideoLLMs

Mar 13, 2026•Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su•View PDF

TL;DR Highlight

VideoLLM이 pan/tilt/dolly 같은 카메라 움직임을 제대로 못 인식하는 문제를, 3D 기하학 모델에서 뽑은 카메라 정보를 프롬프트에 주입해서 학습 없이 해결하는 파이프라인

Who Should Read

영상 분석, 영화/미디어 콘텐츠 이해 AI를 개발하는 ML 엔지니어. VideoLLM을 활용해 촬영 기법이나 카메라 워크를 자동 분석하려는 개발자.

Core Mechanics

현재 Qwen2.5-VL, InternVL 같은 오픈소스 VideoLLM 대부분이 카메라 움직임(pan/tilt/dolly 등) 인식 정확도가 랜덤 추측(25%) 수준에 불과함
Qwen2.5-VL의 ViT(이미지 인코더) 내부를 분석했더니, 카메라 움직임 정보가 얕은 레이어에선 있다가 깊은 레이어로 갈수록 사라짐 - 의미론적 정렬에 최적화되면서 기하학 정보가 희석됨
VGGT(1.2B 파라미터 3D 기하학 트랜스포머)로 카메라 토큰을 추출하고, 경량 Transformer 분류기로 카메라 움직임 예측 → 별도 파인튜닝 없이 VideoLLM에 구조화된 프롬프트로 주입
Unreal Engine 5 합성 데이터로 만든 CameraMotionDataset(12,274개 1초 세그먼트, 15가지 원자적 동작 레이블)과 VQA 벤치마크 CameraMotionVQA를 새로 공개
VGGT를 경량 Q-Former로 증류(distillation)하면 정확도는 8% 정도 낮아지지만, 처리 속도 5.3배 향상, 메모리 사용량 39% 수준으로 효율과 정확도 트레이드오프 확보
카메라 움직임 레이블을 프롬프트 앞에 붙이면(per-second motion header), VideoLLM이 방향과 시간 구조를 명확히 서술하는 촬영감독 스타일 설명을 생성함

Evidence

VGGT + constraint 기반 분류기: Instance Accuracy 0.738, Macro-F1 0.87, Weighted-F1 0.92 vs. 오프더쉘프 VideoLLM들은 대부분 랜덤(25%) 수준
VGGT-Q-Former 증류 모델: 파라미터 8.72M으로 VGGT(1.2B) 대비 처리량 4.39 → 23.36 samples/s (5.3배), 피크 메모리 23649MB → 9202MB (39% 수준)
pose-to-label 자동 레이블링의 인간 검증 정확도 93% (720개 세그먼트 샘플링 검증)
Q-Former 프로빙 실험: Qwen2.5-VL의 7번째 블록(얕은 레이어)에서 카메라 정보 가장 잘 추출 가능, 이후 깊어질수록 성능 하락

How to Apply

VideoLLM으로 영상 설명 생성 시, VGGT로 1초 단위 카메라 움직임을 미리 추출해 'Per-second camera motion: [pan-left, static, tilt-up, ...]' 형태로 프롬프트 앞에 추가하면 파인튜닝 없이 카메라 인식 품질이 올라감
영화/영상 콘텐츠 메타데이터 자동 생성 파이프라인에서, 논문의 15가지 원자적 카메라 동작 분류 체계(pan/tilt/dolly/truck/crane/roll/arc/static)와 incompatibility constraint(상반된 방향 동시 불가)를 적용하면 일관성 있는 레이블 생성 가능
VGGT 전체(1.2B)가 무거운 경우, Q-Former 기반 증류 모델로 대체하면 VideoLLM의 기존 vision feature를 재활용해 추가 3D 백본 없이 5배 빠른 카메라 인식 가능

Code Example

snippet

# VideoLLM에 카메라 움직임 정보를 주입하는 구조화된 프롬프트 템플릿
# VGGT로 예측한 per-second 카메라 움직임을 프롬프트에 주입

per_second_motions = [
    "pan-left",           # 1초
    "static",             # 2초
    "pan-right",          # 3초
    "pan-left and tilt-up",  # 4초 (복합 동작)
    "static"              # 5초
]

motion_header = "Per-second camera motion: [" + ", ".join(per_second_motions) + "]"

prompt = f"""Here are {N} consecutive video frames.
They are evenly sampled at a frame rate of {fps} FPS.
{motion_header}
Describe this video using the filmmaker's language, highlighting the lighting,
framing, video composition, and especially camera usage that connects
different frames. For example: "At the beginning, <video content>; then
<camera motion>, <video content>; ...; finally, <camera motion>, <video
content>". Make your description in a paragraph."""

# CameraMotionVQA 벤치마크 평가용 프롬프트
vqa_prompt = """<video>
Identify the camera motion depicted in the video using standard cinematographic terminology.
Options:
(A) pan-left
(B) dolly-in and pan-right
(C) static
(D) tilt-up and truck-left
"""

Terminology

VideoLLM비디오를 이해하는 대형 언어 모델. 여러 프레임을 입력받아 '이 영상에서 무슨 일이 일어나는지' 텍스트로 설명하거나 질문에 답할 수 있음.

3DFM (3D Foundation Model)사진/영상에서 3D 구조, 카메라 위치, 깊이 등을 추론하는 대규모 사전학습 모델. VGGT가 대표적 예시로, 한 번의 순전파로 카메라 파라미터를 추출할 수 있음.

VGGTVisual Geometry Grounded Transformer의 약자. 이미지/영상을 보고 카메라 자세, 깊이 맵, 3D 포인트 등을 한 번에 예측하는 1.2B 파라미터 모델.

ViT (Vision Transformer)이미지를 작은 패치로 나누어 Transformer로 처리하는 이미지 인코더. VideoLLM에서 프레임 특징을 추출하는 기반 구조.

Q-FormerBLIP-2에서 제안된 구조로, 적은 수의 학습 가능한 쿼리 토큰이 대용량 vision feature에서 필요한 정보만 뽑아내는 경량 브릿지 모듈. 안테나처럼 특정 주파수(정보)만 골라 잡는 것과 비슷.

multi-label 분류하나의 입력에 여러 레이블이 동시에 붙을 수 있는 분류 방식. 예: 카메라가 왼쪽으로 회전하면서 동시에 위로 기울어지면 'pan-left + tilt-up' 두 레이블이 함께 붙음.

structured promptingLLM에 자유로운 텍스트 대신 구조화된 형식(예: 시간 순서대로 정리된 카메라 동작 목록)을 제공해 더 정확한 출력을 유도하는 기법.

Knowledge Distillation크고 좋은 '선생' 모델의 지식을 작고 빠른 '학생' 모델이 흉내내도록 학습시키는 방법. 여기선 VGGT(1.2B)의 카메라 인식 능력을 Q-Former(수 MB)에 압축.

Related Resources

Original Abstract (Expand)

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.