Two-Stage Vision Transformer-Based Framework for Anomaly Detection and Classification in Surveillance Videos

  • Mahedi Hasan
  • , Jubair Ahmed Nabin
  • , Naeem Mia
  • , Fahim Shakil Tamim
  • , Suzad Mohammad
  • , Dipta Mohon Das

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Anomaly detection and classification play a vital role in maintaining public safety and security. An automated system for anomaly detection and classification can reduce human leverage, cost, and time. We propose a two-stage classifier pipelined within a single model for anomaly detection and classification. The first stage of the two-stage classifier is a Convolutional Neural Network(CNN) based binary classifier that determines whether an event is anomalous or normal. Based on the output of the first classifier, if the event is found anomalous, then it goes to the second stage of our single-pipelined model. The second stage classifier is a Vision Transformer (ViT) based architecture that further classifies an anomalous event into specific anomaly categories. This research utilized the UCF Crime dataset. Being quite large, it requires a significant amount of computational resources and time for processing. We also proposed a keyframe extraction algorithm to reduce the computational cost and time. The proposed keyframe extraction algorithm identifies and selects only relevant frames from videos and discards the redundant and irrelevant frames. The proposed methodology combines Convolution Neural Network (CNN) and Vision Transformer (ViT) for spatial-temporal feature extraction from a complex scenario and classify them. The proposed model achieves 98% accuracy for binary classification modules and 95% accuracy for multi-class classification. Furthermore, the proposed keyframe extraction algorithm significantly reduces the processing time and computational resources. For each videos, it requires only 20ms processing time. The outcome of the proposed model suggests that it can outperform traditional methods for anomaly detection and classification. However, a highly correlated and vast amount of data creates problems like overfitting and increases the complexity of the model.

Original languageEnglish
Title of host publication2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350357509
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025 - Chittagong, Bangladesh
Duration: 13 Feb 202515 Feb 2025

Publication series

Name2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025

Conference

Conference2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025
Country/TerritoryBangladesh
CityChittagong
Period13/02/2515/02/25

Keywords

  • Anomaly Detection Classification
  • CNN
  • Keyframe Extraction
  • Transformers (ViT)
  • Vision

Fingerprint

Dive into the research topics of 'Two-Stage Vision Transformer-Based Framework for Anomaly Detection and Classification in Surveillance Videos'. Together they form a unique fingerprint.

Cite this