长短期时间序列关联的视频异常事件检测

朱新瑞; 钱小燕; 施俞洲; 陶旭东; 李智昱

doi:10.11834/jig.230406

图像分析和识别 | 浏览量 : 0 下载量: 16 CSCD: 0

PDF
导出
分享
收藏
专辑

长短期时间序列关联的视频异常事件检测
Video anomaly detection with long-and-short-term time series correlations
2024年29卷第7期页码：1998-2010
纸质出版日期： 2024-07-16 ，
DOI： 10.11834/jig.230406
稿件说明：

移动端阅览

朱新瑞，钱小燕，施俞洲，陶旭东，李智昱. 2024. 长短期时间序列关联的视频异常事件检测. 中国图象图形学报， 29(07):1998-2010

Zhu Xinrui， Qian Xiaoyan， Shi Yuzhou， Tao Xudong， Li Zhiyu. 2024. Video anomaly detection with long-and-short-term time series correlations. Journal of Image and Graphics， 29(07):1998-2010
朱新瑞，钱小燕，施俞洲，陶旭东，李智昱. 2024. 长短期时间序列关联的视频异常事件检测. 中国图象图形学报， 29(07):1998-2010 DOI： 10.11834/jig.230406.

Zhu Xinrui， Qian Xiaoyan， Shi Yuzhou， Tao Xudong， Li Zhiyu. 2024. Video anomaly detection with long-and-short-term time series correlations. Journal of Image and Graphics， 29(07):1998-2010 DOI： 10.11834/jig.230406.

摘要

目的

多示例学习是解决弱监督视频异常事件检测问题的有力工具。异常事件发生往往具有稀疏性、突发性以及局部连续性等特点，然而，目前的多示例学习方法没有充分考虑示例之间的联系，忽略了视频片段之间的时间关联，无法充分分离正常片段和异常片段。针对这一问题，提出了一种长短期时间序列关联的二阶段异常检测网络。

方法

第1阶段是长短期时间序列关联的异常检测网络（long-and-short-term correlated mil abnormal detection framework， LSC-transMIL），将Transformer结构应用到多示例学习方法中，添加局部和全局时间注意力机制，在学习不同视频片段间的空间关联语义信息的同时强化连续视频片段的时间序列关联；第2阶段构建了一个基于时空注意力机制的异常检测网络，将第1阶段生成的异常分数作为细粒度伪标签，使用伪标签训练策略训练异常事件检测网络，并微调骨干网络，提高异常事件检测网络的自适应性。

结果

实验在两个大型公开数据集上与同类方法比较，两阶段的异常检测模型在UCF-crime、ShanghaiTech数据集上曲线下面积（area under curve， AUC）分别达到82.88%和96.34%，相比同为两阶段的方法分别提高了1.58%和0.58%。消融实验表明了关注时间序列的Transformer模块以及长短期注意力的有效性。

结论

本文将Transformer应用于时间序列的多示例学习，并添加长短期注意力，突出局部异常事件和正常事件的区别，有效检测视频中的异常事件。

Abstract

Objective

Video anomaly detection has been applied in many fields such as manufacturing， traffic management and security monitoring. However， detailed annotation of video data is labor intensive and cumbersome. Consequently， many researchers have started to employ weakly supervised learning methods to address this issue. Unlike the supervised learning method， the weakly supervised learning only requires video-level labels in the training stage， which greatly reduces the workload of dataset labeling， and only frame-level labeling information is required for the test dataset. Multiple instance learning （MIL） has been recognized as a powerful tool for addressing weakly supervised video abnormal event detection. Abnormal behavior in video is highly correlated with video context information. The traditional MIL method uses convolutional 3D network to extract video features， uses the ordering loss function， and introduces sparsity and time smoothing constraints into the ordering loss function to integrate time information into the ordering model. Introducing time concern only into the loss function is not enough. The use of temporal convolutional network to extract video context information further enhances the effect of video anomaly detection network. However， this global introduction of time information cannot sufficiently separate abnormal video clips from normal video clips. Therefore， the attention MIL builds time-enhancing networks to learn motion features while using the attention mechanism to incorporate temporal information into the ranking model. The learned attention weights can help better distinguish between abnormal and normal video clips. The spatiotemporal fusion graph network constructs spatial similarity graphs and temporal continuity graphs separately for video segments， which are then fused to generate a spatiotemporal fusion graph. This approach strengthens the spatiotemporal correlations among video segments， ultimately enhancing the accuracy of abnormal behavior detection. Multiple instance self-training framework uses pseudo-label training， which is an effective training strategy to improve model quality in weakly supervised learning. It constructs a two-stage training network and uses the pseudo-label trained by the first-stage MIL to guide the training of the second-stage self-guided attention feature extractor， providing a general idea to improve model quality. However， these approaches do not fully exploit temporal correlations， as the feature representation of the instances lacks fusion with neighboring and global features. Abnormal events often exhibit characteristics such as sparsity， suddenness， and local continuity， and the insufficient temporal correlations between video segments result in an inadequate separation between normal and abnormal segments. To address this issue， this paper proposes a two-stage abnormal detection network with long-and-short-term time series association.

Method

The first stage involves a long-and-short-term time series association abnormal detection network （LSC-transMIL） that applies the Transformer structure to MIL methods. It consists of two layers， each containing a local temporal sequence correlation attention module and a global instance correlation attention module. The former learns information in the temporal dimension between individual instances and neighboring instances， while the latter focuses on the association between individual instances and global information. Combining local and global attention mechanisms makes it possible to establish meaningful information correlations among instances， highlighting the distinctions between local and global features in the video. This approach makes it easier to distinguish abnormal video segments from normal ones. This module generates new instance features， which are then fed into the ranking model to generate video abnormal scores and pseudo-labels. In the second stage， a spatiotemporal attention mechanism-based abnormal detection network is constructed. The SlowFast backbone network is employed to extract video features， and the slow and fast pathway features are weighted and fused using spatiotemporal attention. The slow branch pays attention to the spatiotemporal information of the video frame using the spatiotemporal attention module， while the fast branch guides the attention to the temporal information through the time-dimensional attention module， and then the two branch features are spliced to obtain the final video features. The abnormal scores generated in the first stage are used as fine-grained pseudo-labels to train the abnormal event detection network by using a pseudo-labeling strategy. Furthermore， the backbone network is fine-tuned to enhance the adaptive capability of the abnormal event detection network.

Result

Extensive experiments were conducted on two large-scale public datasets （UCF-crime and ShanghaiTech） to compare the proposed two-stage abnormal detection model with similar methods. The two-stage model achieved area under the curve scores of 82.88% and 96.34% on the UCF-crime and ShanghaiTech datasets， respectively， demonstrating an improvement of 1.58% and 0.58% compared with other two-stage methods. Sufficient ablation experiments were conducted on the two datasets， and the effects of the proposed LSC-transMIL， traditional MIL method， and attention MIL method were compared under three backbone networks， proving the effectiveness of LSC-transMIL. Qualitative and quantitative explanations are given for the ablation experiments of global attention and global local attention， and the effectiveness of combining local and global attention is proved. The role of local and global time correlation is visualized using heat maps.

Conclusion

This paper applies the Transformer to time series-based MIL and introduces long-and-short-term attention to highlight the differences between local abnormal events and normal events. The proposed two-stage abnormal detection network utilizes the abnormal scores generated in the first stage as pseudo-labels， trains a network based on the SlowFast backbone network and spatiotemporal attention modules， and fine-tunes the backbone network to enhance the adaptive capability of the abnormal detection network. The proposed approach effectively improves the accuracy of abnormal event detection.

关键词

异常检测Transformer网络时空注意力多示例学习（MIL）弱监督

Keywords

anomaly detectionTransformerspatio-temporal attentionmultiple instance learning（MIL）weakly supervised

references

Abbas Z K and Al-Ani A A. 2022. A comprehensive review for video anomaly detection on videos//Proceedings of 2022 International Conference on Computer Science and Software Engineering （CSASE）. Duhok， Iraq： IEEE： #9759598 ［DOI： 10.1109/CSASE51777.2022.9759598http://dx.doi.org/10.1109/CSASE51777.2022.9759598］

Carbonneau M A， Cheplygina V， Granger E and Gagnon G. 2018. Multiple instance learning： a survey of problem characteristics and applications. Pattern Recognition， 77： 329-353 ［DOI： 10.1016/j.patcog.2017.10.009http://dx.doi.org/10.1016/j.patcog.2017.10.009］

Carreira J and Zisserman A. 2017. Quo vadis， action recognition？ A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 4724-4733 ［DOI： 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502］

Feichtenhofer C， Fan H Q， Malik J and He K M. 2019. SlowFast networks for video recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 6201-6210 ［DOI： 10.1109/ICCV.2019.00630http://dx.doi.org/10.1109/ICCV.2019.00630］

Feng J C， Hong F T and Zheng W S. 2021. MIST： multiple instance self-training framework for video anomaly detection//Proceedings of 2021 IEEE/CVF conference on computer vision and pattern recognition. Nashville， USA： IEEE： 14004-14013 ［DOI： 10.1109/CVPR46437.2021.01379http://dx.doi.org/10.1109/CVPR46437.2021.01379］

Gong Y L， Wang C， Dai X M， Yu S H， Xiang L H and Wu J F. 2022. Multi-scale continuity-aware refinement network for weakly supervised video anomaly detection//Proceedings of 2022 IEEE International Conference on Multimedia and Expo （ICME）. Taipei， China： IEEE： 1-6 ［DOI： 10.1109/ICME52920.2022.9860012http://dx.doi.org/10.1109/ICME52920.2022.9860012］

Ilse M， Tomczak J M and Welling M. 2018. Attention-based deep multiple instance learning//Proceedings of the 35th International Conference on Machine Learning. Stockholm， Sweden： JMLR： 2127-2136

Li S， Liu F and Jiao L C. 2022. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 1395-1403 ［DOI： 10.1609/aaai.v36i2.20028http://dx.doi.org/10.1609/aaai.v36i2.20028］

Liang J F， Li T， Yang J Q， Li Y N， Fang Z W and Yang F. 2023. Video anomaly detection by fusing self-attention and autoencoder. Journal of Image and Graphics， 28（4）： 1029-1040

梁家菲，李婷，杨佳琪，李亚楠，方智文，杨丰. 2023. 融合自注意力和自编码器的视频异常检测. 中国图象图形学报， 28（4）： 1029-1040 ［DOI： 10.11834/jig.211147http://dx.doi.org/10.11834/jig.211147］

Liu K and Ma H D. 2019. Exploring background-bias for anomaly detection in surveillance videos//Proceedings of the 27th ACM International Conference on Multimedia. Nice， France： ACM： 1490-1499 ［DOI： 10.1145/3343031.3350998http://dx.doi.org/10.1145/3343031.3350998］

Liu W， Luo W X， Lian D Z and Gao S H. 2018. Future frame prediction for anomaly detection——a new baseline//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 6536-6545 ［DOI： 10.1109/CVPR.2018.00684http://dx.doi.org/10.1109/CVPR.2018.00684］

Liu Y， Liu J， Zhao M Y， Li S and Song L. 2022. Collaborative normality learning framework for weakly supervised video anomaly detection. IEEE Transactions on Circuits and Systems II： Express Briefs， 69（5）： 2508-2512 ［DOI： 10.1109/TCSII.2022.3161061http://dx.doi.org/10.1109/TCSII.2022.3161061］

Ma H L and Zhang L Y. 2022. Attention-based framework for weakly supervised video anomaly detection. The Journal of Supercomputing， 78（6）： 8409-8429 ［DOI： 10.1007/s11227-021-04190-9http://dx.doi.org/10.1007/s11227-021-04190-9］

Majhi S， Dash R and Sa P K. 2020. Temporal pooling in inflated 3DCNN for weakly-supervised video anomaly detection//Proceedings of the 11th International Conference on Computing， Communication and Networking Technologies （ICCCNT）. Kharagpur， India： IEEE： 1-6 ［DOI： 10.1109/ICCCNT49239.2020.9225378http://dx.doi.org/10.1109/ICCCNT49239.2020.9225378］

Shao Z C， Bian H， Chen Y， Wang Y F， Zhang J， Ji X Y and Zhang Y B. 2021. TransMIL： Transformer based correlated multiple instance learning for whole slide image classification ［EB/OL］. ［2023-06-28］. https://arxiv.org/pdf/2106.00908.pdfhttps://arxiv.org/pdf/2106.00908.pdf

Shi X S， Xing F Y， Xie Y P， Zhang Z Z， Cui L and Yang L. 2020. Loss-based attention for deep multiple instance learning//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York， USA： AAAI： 5742-5749 ［DOI： 10.1609/aaai.v34i04.6030http://dx.doi.org/10.1609/aaai.v34i04.6030］

Sultani W， Chen C and Shah M. 2018. Real-world anomaly detection in surveillance videos//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 6479-6488 ［DOI： 10.1109/CVPR.2018.00678http://dx.doi.org/10.1109/CVPR.2018.00678］

Sun C， Jia Y D， Hu Y and Wu Y W. 2020. Scene-aware context reasoning for unsupervised abnormal event detection in videos//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 184-192 ［DOI： 10.1145/3394171.3413887http://dx.doi.org/10.1145/3394171.3413887］

Wan B Y， Fang Y M， Xia X and Mei J J. 2020. Weakly supervised video anomaly detection via center-guided discriminative learning//Proceedings of 2020 IEEE International Conference on Multimedia and Expo （ICME）. London， UK： IEEE： #9102722 ［DOI： 10.1109/ICME46284.2020.9102722http://dx.doi.org/10.1109/ICME46284.2020.9102722］

Wang Z G and Zhang Y J. 2020. Anomaly detection in surveillance videos： a survey. Journal of Tsinghua University （Science and Technology）， 60（6）： 518-529

王志国，章毓晋. 2020. 监控视频异常检测：综述. 清华大学学报（自然科学版）， 60（6）： 518-529 ［DOI： 10.16511/j.cnki.qhdxxb.2020.22.008http://dx.doi.org/10.16511/j.cnki.qhdxxb.2020.22.008］

Wang Z W， She Q and Smolic A. 2021. ACTION-Net： multipath excitation for action recognition//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 13209-13218 ［DOI： 10.1109/CVPR46437.2021.01301http://dx.doi.org/10.1109/CVPR46437.2021.01301］

Wang Z M， Zou Y X and Zhang Z M. 2020. Cluster attention contrast for video anomaly detection//Proceedings of the 28th ACM International Conference on Multimedia. Seattle， USA： ACM： 2463-2471 ［DOI： 10.1145/3394171.3413529http://dx.doi.org/10.1145/3394171.3413529］

Wei X S and Zhou Z H. 2016. An empirical study on image bag generators for multi-instance learning. Machine Learning， 105（2）： 155-198 ［DOI： 10.1007/s10994-016-5560-1http://dx.doi.org/10.1007/s10994-016-5560-1］

Zach C， Pock T and Bischof H. 2007. A duality based approach for realtime TV-L1 optical flow//Proceedings of the 29th DAGM Symposium on Pattern Recognition. Heidelberg， Germany： Springer： 214-223 ［DOI： 10.1007/978-3-540-74936-3_22http://dx.doi.org/10.1007/978-3-540-74936-3_22］

Zaheer M Z， Mahmood A， Khan M H， Segu M， Yu F and Lee S I. 2022. Generative cooperative learning for unsupervised video anomaly detection//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 14724-14734 ［DOI： 10.1109/CVPR52688.2022.01433http://dx.doi.org/10.1109/CVPR52688.2022.01433］

Zhang J G， Qing L and Miao J. 2019. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection//Proceedings of 2019 IEEE International Conference on Image Processing （ICIP）. Taipei， China： IEEE： 4030-4034 ［DOI： 10.1109/ICIP.2019.8803657http://dx.doi.org/10.1109/ICIP.2019.8803657］

Zhong J X， Li N N， Kong W J， Liu S， Li T H and Li G. 2019. Graph convolutional label noise cleaner： train a plug-and-play action classifier for anomaly detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 1237-1246 ［DOI： 10.1109/CVPR.2019.00133http://dx.doi.org/10.1109/CVPR.2019.00133］

Zhou H， Zhan Y Z and Mao Q R. 2021. Video anomaly detection based on space-time fusion graph network learning. Journal of Computer Research and Development， 58（1）： 48-59

周航，詹永照，毛启容. 2021. 基于时空融合图网络学习的视频异常事件检测. 计算机研究与发展， 58（1）： 48-59 ［DOI： 10.7544/issn1000-1239202120200264http://dx.doi.org/10.7544/issn1000-1239202120200264］

Zhu Y and Newsam S. 2019. Motion-aware feature for improved video anomaly detection ［EB/OL］. ［2023-06-28］. https://arxiv.org/pdf/1907.10211.pdfhttps://arxiv.org/pdf/1907.10211.pdf

文章被引用时，请邮件提醒。

提交

结合孪生网络和像素配对的高光谱图像异常检测