When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis

Jan 17, 2025•Ruixuan Zhang, Beichen Wang, Juexiao Zhang +3•View PDF

TL;DR Highlight

SeeUnsafe: a GPT-4o-based MLLM agent framework that automatically classifies traffic accidents from CCTV footage and identifies involved objects.

Who Should Read

Backend/ML engineers building automated analysis pipelines for large-scale CCTV footage in traffic safety or smart city platforms. Developers wanting to apply multimodal LLMs to real video analysis.

Core Mechanics

Overcomes MLLM's long-video limitation by splitting long videos into multiple clips and aggregating results using 'severity-based aggregation'
Uses GroundingDINO (open-vocabulary detection) and Segment Anything (object segmentation) to add bounding boxes as visual prompts, enabling GPT-4o to pinpoint accident-related objects
BLEU/ROUGE give high scores even when confusing 'pedestrian' with 'cyclist' — inappropriate for traffic safety evaluation, proposed new IMS (Information Matching Score) metric
GPT-4o-based SeeUnsafe achieved 76.31% classification accuracy and 51.47% visual grounding success — outperforming vanilla GPT-4o (71.49%) and GPT-4o mini (58.23%)
Structured output format (Video Class / Scene Context / Object Description / Justification) enables database indexing without post-processing
Night vision case where visual prompts (bounding box overlay) actually hurt performance — needs strategy adjustment based on input quality

Evidence

SeeUnsafe (GPT-4o) classification accuracy 76.31% vs GPT-4o vanilla 71.49% vs GPT-4o mini vanilla 58.23% vs VideoCLIP 27.71%
Visual grounding success rate: 51.47% of 136 videos; 87.5% of 88 videos with valid masks
ROUGE stays above 0.90 even when pedestrian/cyclist are confused; BLEU drops 21.3% just for location errors — inconsistent behavior
Night vs daytime: night 42.11% (no VP) vs daytime 68.18% (no VP); adding VP at night caused drop to 36.84%

How to Apply

For bulk CCTV pipeline: split video into 3 clips of 3 frames each, classify each with GPT-4o, implement severity aggregation selecting the most severe class as final label.
For accident object tracking: run GroundingDINO on first frame to detect person/car/cyclist, track with SAM across frames, overlay only bounding boxes and pass to GPT-4o.
For LLM-based response quality evaluation: copy the IMS prompt (Prompt 5), configure a GPT-4o evaluation agent, and average 3 runs at temperature=0.5 for more reliable scores.

Code Example

snippet

Terminology

MLLMA large language model that can process multiple input modalities simultaneously — text, images, video. Like GPT-4o where you can show photos and ask questions.

Visual GroundingPrecisely identifying a specific object in video/image with a box or mask. A step beyond simple classification.

Severity-based AggregationWhen combining multi-clip analysis results, adopting the most severe judgment as the final result instead of majority vote.

GroundingDINOAn open-source detection model that finds objects described in text within images. Can recognize open-vocabulary descriptions.

Segment Anything (SAM)Meta's general-purpose segmentation model. Given a box hint, automatically draws precise object boundaries (masks).

IMS (Information Matching Score)A new evaluation metric where an MLLM directly grades how well generated text matches ground truth. Considers context and importance.

Visual PromptAdding visual hints (arrows, bounding boxes, numbers) to images to help models focus on specific areas.

Zero-shotPerforming a task directly without any fine-tuning.

Related Resources

Original Abstract (Expand)

The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol still remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shist significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation to enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and enables visual grounding by building upon off-the-shelf MLLMs. Our code will be made publicly available upon acceptance.