红外与可见光图像特征动态选择的目标检测网络
Infrared-visible image object detection algorithm using feature dynamic selection
- 2024年29卷第8期 页码:2350-2363
纸质出版日期: 2024-08-16
DOI: 10.11834/jig.230495
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-08-16 ,
移动端阅览
许可, 刘心溥, 汪汉云, 万建伟, 郭裕兰. 2024. 红外与可见光图像特征动态选择的目标检测网络. 中国图象图形学报, 29(08):2350-2363
Xu Ke, Liu Xinpu, Wang Hanyun, Wan Jianwei, Guo Yulan. 2024. Infrared-visible image object detection algorithm using feature dynamic selection. Journal of Image and Graphics, 29(08):2350-2363
目的
2
基于可见光和红外双模态图像融合的目标检测算法是解决复杂场景下目标检测任务的有效手段。然而现有双光检测算法中的特征融合过程存在两大问题:一是特征融合方式较为简单,逐特征元素相加或者并联操作导致特征融合效果不佳;二是算法结构中仅有特征融合过程,而缺少特征选择过程,导致有用特征无法得到高效利用。为解决上述问题,提出了一种基于动态特征选择的可见光红外图像融合目标检测算法。
方法
2
本文算法包含特征的动态融合层和动态选择层两个创新模块:动态融合层嵌入在骨干网络中,利用Transformer结构,多次对多源的图像特征图进行特征融合,以丰富特征表达;动态选择层嵌入在颈部网络中,利用3种注意力机制对多尺度特征图进行特征增强,以筛选有用特征。
结果
2
本文算法在FLIR、LLVIP(visible-infrared paired dataset for low-light vision)和VEDAI(vehicle detection in aerial imagery) 3个公开数据集上开展实验验证,与多种特征融合方式进行平均精度均值(mean average precision,mAP)性能比较,mAP50指标相比于基线模型分别提升了1.3%、0.6%和3.9%;mAP75指标相比于基线模型分别提升了4.6%、2.6%和7.5%;mAP指标相比于基线模型分别提升了3.2%、2.1%和3.1%。同时设计了相关结构的消融实验,验证了所提算法的有效性。
结论
2
提出的基于动态特征选择的可见光红外图像融合目标检测算法,可以有效地融合可见光和红外两种图像模态的特征信息,提升了目标检测的性能。
Objective
2
In recent years, considerable attention has been given to the object detection algorithm that utilizes the fusion of visible and infrared dual-modal images. This algorithm serves as an effective approach for addressing object detection tasks in complex scenes. The process of object detection algorithms can be roughly divided into three stages. The first stage is feature extraction, which aims to extract geometric features from the input data. Next, the extracted features are fed into the neck network for multi-scale feature fusion. Finally, the fused features are input into the detection network to output object detection results. Similarly, dual-modal detection algorithms follow the same process to achieve object localization and classification. The difference lies in the fact that traditional object detection focuses on single-modal visible images, while dual-modal detection focuses on visible and infrared image data. The dual-modal detection algorithm aims to simultaneously utilize information from infrared and visible images. It merges these images to obtain more comprehensive and accurate target information, which enhances the accuracy and robustness of the object detection process. Traditional fusion methods encompass pixel-level fusion and feature-level fusion. Pixel-level fusion employs a straightforward weighted overlay technique on the two types of images, which enhances the contrast and edge information of the targets. Meanwhile, feature-level fusion extracts features from the infrared and visible images and combines them to enhance the representation capability of the targets. However, the feature fusion process of existing dual-modal detection algorithms faces two major issues. First, the feature fusion methods employed are relatively simple, which involves the addition or parallel operation of individual feature elements. Consequently, these methods yield unsatisfactory fusion effects that limit the performance of subsequent object detection. Second, the algorithm structure solely focuses on the feature fusion process, which neglects the crucial feature selection process. This deficiency results in the inefficient utilization of valuable features.
Method
2
In this study, we introduce a visible and infrared image fusion object detection algorithm that employs dynamic feature selection to address the two issues mentioned above. Overall, we propose enhancements to the conventional YOLOv5 detector through modifications to its backbone, neck, and detection head components. We select CSPDarkNet53 as the backbone, which possesses an identical structure for visible and infrared image branches. The algorithm incorporates two innovative modules: dynamic fusion layer and dynamic selection layer. The proposed algorithm includes embedding the dynamic fusion layer in the backbone network, which utilizes the Transformer structure for multiple feature fusions in multi-source image feature maps to enrich feature expression. Moreover, it employs the dynamic selection layer in the neck network, which uses three attention mechanisms (i.e., scale, space, and channel) to improve multi-scale feature maps and screen useful features. These mechanisms are implemented with SENet and deformable convolutions. In line with standard practices in target detection algorithms, we utilize the detection head of YOLOv5 to generate detection results. The loss function employed for algorithm training is the combined sum of bounding box regression loss, classification loss, and confidence loss, which are implemented with generalized intersection over union, cross entropy, and squared-error functions, respectively.
Result
2
In this study, we validate our proposed algorithm through experimental evaluation on three publicly available datasets: FLIR, visible-infrared paired dataset for low-light vision (LLVIP), and vehicle detection in aerial imagery (VEDAI). We use the mean average precision (mAP) for evaluation. Compared with the baseline model that adds features individually, our algorithm achieves improvements of 1.3%, 0.6%, and 3.9% in mAP50 scores and 4.6%, 2.6%, and 7.5% in mAP75 scores. In addition, our algorithm demonstrates enhancements of 3.2%, 2.1%, and 3.1% in mAP scores on the respective datasets, which effectively reduces the probability of object omission and false alarms. Moreover, we conduct ablation experiments on two innovative modules: the dynamic fusion layer and the dynamic selection layer. The complete algorithm model, which incorporates the two layers, achieves the best performance on all three test datasets. This performance validates the effectiveness of our proposed algorithm. We also compare the network model size and computational efficiency of these state-of-the-art algorithms, and experiments show that our algorithm can significantly improve algorithm performance while slightly increasing parameter computation. Furthermore, we visualize the attention weight matrices of the three dynamic fusion layers in the backbone to better reveal the mechanism of the dynamic fusion layer. The visual analysis confirms that the dynamic fusion layer effectively integrates the feature information from visible and infrared images.
Conclusion
2
In this study, we propose a visible and infrared image fusion-based object detection algorithm using dynamic feature selection strategy. This algorithm incorporates two innovative modules: dynamic fusion layer and dynamic selection layer. Through extensive experiments, we demonstrate that our algorithm effectively integrates feature information from visible and infrared image modalities, which enhances the performance of object detection. However, the proposed algorithm has a little increasing computational complexity and requires pre-registration of the input visible and infrared images, which limits some application scenarios of the algorithm. The research on lightweight fusion modules and algorithms capable of processing unregistered dual light images will be the focus of future research in the field of multimodal fusion target detection.
红外图像目标检测注意力机制特征融合深度神经网络
infrared imageobject detectionattention mechanismfeature fusiondeep neural network
Bie Q, Wang X, Xu X, Zhao Q J, Wang Z, Chen J and Hu R M. 2023. Visible-infrared cross-modal pedestrian detection: a summary. Journal of Image and Graphics, 28(5): 1287-1307
别倩, 王晓, 徐新, 赵启军, 王正, 陈军, 胡瑞敏. 2023. 红外—可见光跨模态的行人检测综述. 中国图象图形学报, 28(5): 1287-1307 [DOI: 10.11834/jig.220670http://dx.doi.org/10.11834/jig.220670]
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A and Zagoruyko S. 2020. End-to-end object detection with Transformers//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 213-229 [DOI: 10.1007/978-3-030-58452-8_13http://dx.doi.org/10.1007/978-3-030-58452-8_13]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words: Transformers for image recognition at scale [EB/OL]. [2023-07-19]. https://arxiv.org/pdf/2010.11929v1.pdfhttps://arxiv.org/pdf/2010.11929v1.pdf
Fang Q Y, Han D P and Wang Z K. 2022. Cross-modality fusion Transformer for multispectral object detection [EB/OL]. [2023-07-19]. https://arxiv.org/pdf/2111.00273.pdfhttps://arxiv.org/pdf/2111.00273.pdf
Fang Q Y and Wang Z K. 2022. Cross-modality attentive feature fusion for object detection in multispectral remote sensing imagery. Pattern Recognition, 130: #108786 [DOI: 10.1016/j.patcog.2022.108786http://dx.doi.org/10.1016/j.patcog.2022.108786]
Fu H L, Wang S X, Duan P H, Xiao C Y, Dian R W, Li S T and Li Z Y. 2023. LRAF-Net: long-range attention fusion network for visible-infrared object detection. IEEE Transactions on Neural Networks and Learning Systems: #3266452 [DOI: 10.1109/TNNLS.2023.3266452http://dx.doi.org/10.1109/TNNLS.2023.3266452]
Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R and Hu S M. 2021. PCT: point cloud Transformer. Computational Visual Media, 7(2): 187-199 [DOI: 10.1007/s41095-021-0229-5http://dx.doi.org/10.1007/s41095-021-0229-5]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [DOI: 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]
Jia X Y, Zhu C, Li M Z, Tang W Q and Zhou W L. 2021. LLVIP: a visible-infrared paired dataset for low-light vision//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal, Canada: IEEE: 3489-3497 [DOI: 10.1109/ICCVW54120.2021.00389http://dx.doi.org/10.1109/ICCVW54120.2021.00389]
Li C Y, Song D, Tong R F and Tang M. 2018. Multispectral pedestrian detection via simultaneous detection and segmentation. [EB/OL]. [2023-07-19]. http://arxiv.org/pdf/1808.04818.pdfhttp://arxiv.org/pdf/1808.04818.pdf
Lin T Y, Dollr P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944 [DOI: 10.1109/CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P and Zitnick C L. 2014. Microsoft COCO: common objects in context//Proceedings of the 13th European Conference Computer Vision. Zürich, Switzerland: Springer: 740-755 [DOI: 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48]
Liu S, Qi L, Qin H F, Shi J P and Jia J Y. 2018. Path aggregation network for instance segmentation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 8759-8768 [DOI: 10.1109/CVPR.2018.00913http://dx.doi.org/10.1109/CVPR.2018.00913]
Liu X P, Ma Y X, Xu K, Wan J W and Guo Y L. 2022. Multi-scale Transformer based point cloud completion network. Journal of Image and Graphics, 27(2): 538-549
刘心溥, 马燕新, 许可, 万建伟, 郭裕兰. 2022. 嵌入Transformer结构的多尺度点云补全. 中国图象图形学报, 27(2): 538-549 [DOI: 10.11834/jig.210510http://dx.doi.org/10.11834/jig.210510]
Qiu D F, Hu X Y, Liang P W, Liu X M and Jiang J J. 2023. A deep progressive infrared and visible image fusion network. Journal of Image and Graphics, 28(1): 156-165
邱德粉, 胡星宇, 梁鹏伟, 刘贤明, 江俊君. 2023. 红外与可见光图像渐进融合深度网络. 中国图象图形学报, 28(1): 156-165 [DOI: 10.11834/jig.220319http://dx.doi.org/10.11834/jig.220319]
Razakarivony S and Jurie F. 2016. Vehicle detection in aerial imagery: a small target detection benchmark. Journal of Visual Communication and Image Representation, 34: 187-203 [DOI: 10.1016/j.jvcir.2015.11.002http://dx.doi.org/10.1016/j.jvcir.2015.11.002]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031]
Saga F’s team. Free FLIR thermal dataset for algorithm training [DB/OL]. [2023-07-19]. https://www.flir.com/oem/adas/adas-dataset-form/https://www.flir.com/oem/adas/adas-dataset-form/
Sun X H, Guan Z and Wang X. 2023. Vision Transformer for fusing infrared and visible images in groups. Journal of Image and Graphics, 28(1): 166-178
孙旭辉, 官铮, 王学. 2023. 红外与可见光图像分组融合的视觉Transformer. 中国图象图形学报, 28(1): 166-178 [DOI: 10.11834/jig.220515http://dx.doi.org/10.11834/jig.220515]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: MIT Press: Curran Associates Inc.: 6000-6010
Wang Y, Guizilini V C, Zhang T Y, Wang Y L, Zhao H and Solomon J. 2022. DETR3D: 3D object detection from multi-view images via 3D-to-2D queries//Proceedings of the 5th Conference on Robot Learning. London, UK: [s.n.]: 180-191
Zhang H, Fromont E, Lefevre S and Avignon B. 2021. Guided attentive feature fusion for multispectral pedestrian detection//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 72-80 [DOI: 10.1109/WACV48630.2021.00012http://dx.doi.org/10.1109/WACV48630.2021.00012]
Zhang L, Liu Z Y, Zhang S F, Yang X, Qiao H, Huang K Z and Hussain A. 2019. Cross-modality interactive attention network for multispectral pedestrian detection. Information Fusion, 50: 20-29 [DOI: 10.1016/j.inffus.2018.09.015http://dx.doi.org/10.1016/j.inffus.2018.09.015]
Zhu X Z, Hu H, Lin S and Dai J F. 2019. Deformable ConvNets V2: more deformable, better results//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9300-9308 [DOI: 10.1109/CVPR.2019.00953http://dx.doi.org/10.1109/CVPR.2019.00953]
相关作者
相关机构