When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
TL;DR Highlight
SeeUnsafe: a GPT-4o-based MLLM agent framework that automatically classifies traffic accidents from CCTV footage and identifies involved objects.
Who Should Read
Backend/ML engineers building automated analysis pipelines for large-scale CCTV footage in traffic safety or smart city platforms. Developers wanting to apply multimodal LLMs to real video analysis.
Core Mechanics
- Overcomes MLLM's long-video limitation by splitting long videos into multiple clips and aggregating results using 'severity-based aggregation'
- Uses GroundingDINO (open-vocabulary detection) and Segment Anything (object segmentation) to add bounding boxes as visual prompts, enabling GPT-4o to pinpoint accident-related objects
- BLEU/ROUGE give high scores even when confusing 'pedestrian' with 'cyclist' — inappropriate for traffic safety evaluation, proposed new IMS (Information Matching Score) metric
- GPT-4o-based SeeUnsafe achieved 76.31% classification accuracy and 51.47% visual grounding success — outperforming vanilla GPT-4o (71.49%) and GPT-4o mini (58.23%)
- Structured output format (Video Class / Scene Context / Object Description / Justification) enables database indexing without post-processing
- Night vision case where visual prompts (bounding box overlay) actually hurt performance — needs strategy adjustment based on input quality
Evidence
- SeeUnsafe (GPT-4o) classification accuracy 76.31% vs GPT-4o vanilla 71.49% vs GPT-4o mini vanilla 58.23% vs VideoCLIP 27.71%
- Visual grounding success rate: 51.47% of 136 videos; 87.5% of 88 videos with valid masks
- ROUGE stays above 0.90 even when pedestrian/cyclist are confused; BLEU drops 21.3% just for location errors — inconsistent behavior
- Night vs daytime: night 42.11% (no VP) vs daytime 68.18% (no VP); adding VP at night caused drop to 36.84%
How to Apply
- For bulk CCTV pipeline: split video into 3 clips of 3 frames each, classify each with GPT-4o, implement severity aggregation selecting the most severe class as final label.
- For accident object tracking: run GroundingDINO on first frame to detect person/car/cyclist, track with SAM across frames, overlay only bounding boxes and pass to GPT-4o.
- For LLM-based response quality evaluation: copy the IMS prompt (Prompt 5), configure a GPT-4o evaluation agent, and average 3 runs at temperature=0.5 for more reliable scores.
Code Example
Terminology
Related Resources
Original Abstract (Expand)
The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol still remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shist significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation to enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and enables visual grounding by building upon off-the-shelf MLLMs. Our code will be made publicly available upon acceptance.