TY - GEN
T1 - Two-Stage Vision Transformer-Based Framework for Anomaly Detection and Classification in Surveillance Videos
AU - Hasan, Mahedi
AU - Nabin, Jubair Ahmed
AU - Mia, Naeem
AU - Tamim, Fahim Shakil
AU - Mohammad, Suzad
AU - Das, Dipta Mohon
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Anomaly detection and classification play a vital role in maintaining public safety and security. An automated system for anomaly detection and classification can reduce human leverage, cost, and time. We propose a two-stage classifier pipelined within a single model for anomaly detection and classification. The first stage of the two-stage classifier is a Convolutional Neural Network(CNN) based binary classifier that determines whether an event is anomalous or normal. Based on the output of the first classifier, if the event is found anomalous, then it goes to the second stage of our single-pipelined model. The second stage classifier is a Vision Transformer (ViT) based architecture that further classifies an anomalous event into specific anomaly categories. This research utilized the UCF Crime dataset. Being quite large, it requires a significant amount of computational resources and time for processing. We also proposed a keyframe extraction algorithm to reduce the computational cost and time. The proposed keyframe extraction algorithm identifies and selects only relevant frames from videos and discards the redundant and irrelevant frames. The proposed methodology combines Convolution Neural Network (CNN) and Vision Transformer (ViT) for spatial-temporal feature extraction from a complex scenario and classify them. The proposed model achieves 98% accuracy for binary classification modules and 95% accuracy for multi-class classification. Furthermore, the proposed keyframe extraction algorithm significantly reduces the processing time and computational resources. For each videos, it requires only 20ms processing time. The outcome of the proposed model suggests that it can outperform traditional methods for anomaly detection and classification. However, a highly correlated and vast amount of data creates problems like overfitting and increases the complexity of the model.
AB - Anomaly detection and classification play a vital role in maintaining public safety and security. An automated system for anomaly detection and classification can reduce human leverage, cost, and time. We propose a two-stage classifier pipelined within a single model for anomaly detection and classification. The first stage of the two-stage classifier is a Convolutional Neural Network(CNN) based binary classifier that determines whether an event is anomalous or normal. Based on the output of the first classifier, if the event is found anomalous, then it goes to the second stage of our single-pipelined model. The second stage classifier is a Vision Transformer (ViT) based architecture that further classifies an anomalous event into specific anomaly categories. This research utilized the UCF Crime dataset. Being quite large, it requires a significant amount of computational resources and time for processing. We also proposed a keyframe extraction algorithm to reduce the computational cost and time. The proposed keyframe extraction algorithm identifies and selects only relevant frames from videos and discards the redundant and irrelevant frames. The proposed methodology combines Convolution Neural Network (CNN) and Vision Transformer (ViT) for spatial-temporal feature extraction from a complex scenario and classify them. The proposed model achieves 98% accuracy for binary classification modules and 95% accuracy for multi-class classification. Furthermore, the proposed keyframe extraction algorithm significantly reduces the processing time and computational resources. For each videos, it requires only 20ms processing time. The outcome of the proposed model suggests that it can outperform traditional methods for anomaly detection and classification. However, a highly correlated and vast amount of data creates problems like overfitting and increases the complexity of the model.
KW - Anomaly Detection Classification
KW - CNN
KW - Keyframe Extraction
KW - Transformers (ViT)
KW - Vision
UR - https://www.scopus.com/pages/publications/105007760566
U2 - 10.1109/ECCE64574.2025.11013374
DO - 10.1109/ECCE64574.2025.11013374
M3 - Conference contribution
AN - SCOPUS:105007760566
T3 - 2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025
BT - 2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 International Conference on Electrical, Computer and Communication Engineering, ECCE 2025
Y2 - 13 February 2025 through 15 February 2025
ER -