自适应IoU损失和层级关联的多目标跟踪
Multi-object tracking using adaptive-IoU loss and hierarchical association
- 2024年29卷第7期 页码:1970-1983
纸质出版日期: 2024-07-16
DOI: 10.11834/jig.230390
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-07-16 ,
移动端阅览
郭文, 刘其贵, 丁昕苗. 2024. 自适应IoU损失和层级关联的多目标跟踪. 中国图象图形学报, 29(07):1970-1983
Guo Wen, Liu Qigui, Ding Xinmiao. 2024. Multi-object tracking using adaptive-IoU loss and hierarchical association. Journal of Image and Graphics, 29(07):1970-1983
目的
2
针对模糊行人特征造成身份切换的问题和复杂场景下目标之间遮挡造成跟踪精度降低的问题,提出AIoU-Tracker多目标跟踪算法。
方法
2
首先根据骨干网络检测头设计了一个特殊的AIoU(adaptive intersection over union)回归损失函数,从重叠面积、中心点距离和纵横比3个方面去衡量,缓解了由于模糊行人特征判别性不足造成的身份切换现象;其次提出了一种简单有效的层级(hierarchical)关联策略,在高分检测框和低分检测框分别关联之后,充分利用关联失败检测框周围的嵌入信息再次进行关联,提高了在遮挡条件下多目标跟踪的关联精度。
结果
2
通过一系列的对比实验,提出的AIoU-Tracker跟踪方法相比于FairMOT跟踪方法在MOT16数据集上,HOTA(higher order tracking accuracy)值由58.3%提高至59.8%,IDF1(ID F1 score)值由72.6%提高至73.1%,MOTA(multi-object tracking accuracy)值由69.3%提高至74.4%;在MOT17数据集上,HOTA值由59.3%提高至59.9%,IDF1值由72.3%提高至72.9%。
结论
2
本文提出的特征平衡性跟踪方法,使边界框大小特征、热图特征和中心点偏移量特征在训练测试中达到了更好的平衡,使多目标跟踪结果更加准确。
Objective
2
Multiple object tracking (MOT) is a mainstream task in computer vision, which aims mainly to estimate the tracklets of multiple objects in videos and has important applications in the fields of autonomous driving, human-computer interaction, and human activity recognition. A large number of methods focus on improving the tracking performance based on the given detection results. Re-ID based trackers can be divided into two categories: separate detection and embedding (SDE) tracking models and joint detection and embedding (JDE) tracking models. The SDE tracking model tunes the detection model and the Re-ID model separately to optimize the model, but this leads to the disadvantage of the SDE tracking model being unable to perform real-time detection. The JDE tracking model performs object detection while outputting the object location and appearance embedding information for the next step of object association, thus improving the algorithm’s operational speed. However, the JDE tracking method suffers from the problem of identity switching due to ambiguous pedestrian features and the degradation of tracking accuracy due to occlusion between objects in complex scenes. An adaptive intersection-over-union (AIoU)-tracker multi-object tracking algorithm is proposed to address these issues.
Method
2
First, we utilize the backbone network detection head to design a special AIoU regression loss function that measures the overlap area, center point distance, and aspect ratio. This approach helps alleviate the problem caused by identity switching due to ambiguous pedestrian features. Second, we propose a simple and effective hierarchical association method to leverage the embedding information around association failure detection frames for Re-ID. The high-score detection frames and low-score detection frames are associated separately, improving the association accuracy of multi-object tracking under occlusion conditions. We utilize a variant of the DLA-34 network architecture as the backbone network. The model parameters are trained on the common objects in context (COCO) dataset and used to initialize the model. The experiments are conducted on a system running Ubuntu 16.04 with 64 GB of memory and a GTX2080Ti GPU.
The software configuration includes CUDA 10.2. We train the model using the Adam optimizer for 30 epochs, with an initial learning rate of 10
-4
. The learning rate is decayed to 10
-5
after 20 epochs, and the batch size is set to 16. We apply standard data augmentation techniques, including rotation, scaling, and color jittering. The input image size is adjusted to 1 088 × 608 pixels, and the feature map resolution is set to 272 × 152 pixels. We evaluate our approach on the MOT Challenge benchmark, specifically the MOT16 and the MOT17 datasets. The experiments utilize various datasets, including CrowdHuman, MIX dataset (ETH, CityPerson, CUHKSYSU, Caltech, and PRW). The ETH and CityPerson datasets only provide bounding box annotations, so we only train the detection branch on these datasets. The Caltech, MOT17, CUHKSYSU, and PRW datasets provide both bounding box positions and ID annotations, allowing for training of both branches. To ensure a fair comparison, we remove the overlapping videos between the ETH dataset and the MOT17 test dataset. The CrowdHuman dataset only contains bounding box annotations, so we perform self-supervised training on it. To evaluate the tracking performance, we use several well-defined metrics, including higher-order tracking accuracy (HOTA), multi-object tracking accuracy (MOTA), ID F1 score (IDF1), false positive, false negative, and number of identity switches (IDs). MOTA primarily assesses the performance of the detection branch, IDF1 evaluates identity preservation, focusing on the association performance, and HOTA provides a comprehensive evaluation of both the detection branch and the data association performance.
Result
2
The performance of our method is compared with that of existing methods on two datasets. The comparison results are as follows: 1) our HOTA value is 59.8% on the MOT16 dataset, which is increased by 1.5% compared with the FairMOT. Our MOTA value is 74.4% on the MOT16 dataset, which is increased by 5.1% compared with the FairMOT. Our IDF1 value is 73.1% on the MOT16 dataset, which is increased by 0.5% compared with the FairMOT. 2) The HOTA value is 59.9% on the MOT17 dataset, which is increased by 0.6% compared with the FairMOT. The IDF1 value is 72.9% on the MOT17 dataset, which is increased by 1.6% compared with the FairMOT. In addition, we conduct ablation studies on the MOT17 dataset to verify the effectiveness of different components in our method, which demonstrates that the proposed method significantly outperforms the competition in multiple object tracking. In the ablation studies, we observe a decrease in the number of identity switches through the added AIoU regression loss function. We also visualize the predicted Re-ID feature extraction positions, bounding box size feature, heat map feature, and center point offset feature. The visualization results show that our method is more robust than FairMOT. Moreover, our hierarchical association method makes the association more robust. For example, even after two frames, obscured IDs can still be associated.
Conclusion
2
The proposed feature balancing tracking method achieves better balance among the bounding box size feature, heat map feature, and center point offset feature during training and testing, resulting in more accurate multi-object tracking results. In this study, we propose two improvement measures for the FairMOT framework. First, we design an AIoU regression loss module to optimize the detection branch, enabling it to optimize targets based on the current optimal distance and extract more accurate appearance features. Second, we optimize the Re-ID branch through a hierarchical association strategy module, utilizing three-level matching to enhance the tracking system’s association performance. Experimental results demonstrate significant improvements on the MOT17 dataset, with HOTA increasing to 59.9%, IDF1 increasing to 72.9%, and MOTA increasing to 70.8%. However, a competition issue exists between the detection and Re-ID branches in the JDE tracking model, which can lead to a decrease in MOTA. Future research will focus on investigating this competition in the JDE tracking model.
多目标跟踪(MOT)数据关联回归损失特征平衡性级联匹配方法
multi-object tracking (MOT)data associationregression lossfeature balancehierarchical association method
Bewley A, Ge Z Y, Ott L, Ramos F and Upcroft B. 2016. Simple online and realtime tracking//Proceedings of 2016 IEEE International Conference on Image Processing (ICIP). Phoenix, USA: IEEE: 3464-3468 [DOI: 10.1109/ICIP.2016.7533003http://dx.doi.org/10.1109/ICIP.2016.7533003]
Bochinski E, Eiselein V and Sikora T. 2017. High-speed tracking-by-detection without using image information//Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Lecce, Italy: IEEE: 1-6 [DOI: 10.1109/AVSS.2017.8078516http://dx.doi.org/10.1109/AVSS.2017.8078516]
Cai J R, Xu M Z, Li W, Xiong Y J, Xia W, Tu Z W and Soatto S. 2022. MeMOT: multi-object tracking with memory//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE: 8080-8090 [DOI: 10.1109/CVPR52688.2022.00792http://dx.doi.org/10.1109/CVPR52688.2022.00792]
Chan S X, Jia Y W, Zhou X L, Bai C, Chen S Y and Zhang X Q. 2022. Online multiple object tracking using joint detection and embedding network. Pattern Recognition, 130: #108793 [DOI: 10.1016/j.patcog.2022.108793http://dx.doi.org/10.1016/j.patcog.2022.108793]
Dollr P, Wojek C, Schiele B and Perona P. 2009. Pedestrian detection: a benchmark//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 304-311 [DOI: 10.1109/CVPR.2009.5206631http://dx.doi.org/10.1109/CVPR.2009.5206631]
Duan K W, Bai S, Xie L X, Qi H G, Huang Q M and Tian Q. 2019. CenterNet: keypoint triplets for object detection//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE: 6568-6577 [DOI: 10.1109/ICCV.2019.00667http://dx.doi.org/10.1109/ICCV.2019.00667]
Ess A, Leibe B, Schindler K and van Gool L. 2008. A mobile vision system for robust multi-person tracking//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE: 1-8 [DOI: 10.1109/CVPR.2008.4587581http://dx.doi.org/10.1109/CVPR.2008.4587581]
Han S, Huang P, Wang H, Yu E, Liu D and Pan X.2022. Mat: motion-aware multi-object tracking. Neurocomputing, 476: 75-86 [DOI: 10.1016/j.neucom.2021.12.104http://dx.doi.org/10.1016/j.neucom.2021.12.104]
He K M, Gkioxari G, Dollar P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE: 2980-2988 [DOI: 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322]
Huang G, Liu S C, van der Maaten L and Weinberger K Q. 2018. CondenseNet: an efficient DenseNet using learned group convolutions//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2752-2761 [DOI: 10.1109/CVPR.2018.00291http://dx.doi.org/10.1109/CVPR.2018.00291]
Li W, Zhao R, Xiao T and Wang X G. 2014. DeepReID: deep filter pairing neural network for person re-identification//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 152-159 [DOI: 10.1109/CVPR.2014.27http://dx.doi.org/10.1109/CVPR.2014.27]
Liang C, Zhang Z P, Zhou X, Li B, Zhu S Y and Hu W M. 2022. Rethinking the competition between detection and ReID in multiobject tracking. IEEE Transactions on Image Processing, 31: 3182-3196 [DOI: 10.1109/ TIP.2022.3165376http://dx.doi.org/10.1109/TIP.2022.3165376]
Lin T Y, Dollr P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 936-944 [DOI: 10.1109/ CVPR.2017.106http://dx.doi.org/10.1109/CVPR.2017.106]
Luo W H, Xing J L, Milan A, Zhang X Q, Liu W and Kim T K. 2021. Multiple object tracking: a literature review. Artificial Intelligence, 293: #103448 [DOI: 10.1016 /j.artint.2020.103448http://dx.doi.org/10.1016/j.artint.2020.103448]
Milan A, Leal-Taixe L, Reid I, Roth S and Schindler K. 2016. MOT16: a benchmark for multi-object tracking [EB/OL]. [2023-11-01]. https://arxiv.org/pdf/1603.00831.pdfhttps://arxiv.org/pdf/1603.00831.pdf
Park Y, Dang L M, Lee S, Han D and Moon H. 2021. Multiple object tracking in deep learning approaches: a survey. Electronics, 10(19): #2406 [DOI: 10.3390/ electronics10192406http://dx.doi.org/10.3390/electronics10192406]
Peng J L, Wang C G, Wan F B, Wu Y, Wang Y B, Tai Y, Wang C J, Li J L, Huang F Y and Fu Y W. 2020. Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 145-161 [DOI: 10.1007/978-3-030-58548-8_ 9http://dx.doi.org/10.1007/978-3-030-58548-8_9]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI: 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031]
Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I and Savarese S. 2019. Generalized intersection over union: a metric and a loss for bounding box regression//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 658-666 [DOI: 10.1109/CVPR.2019.00075http://dx.doi.org/10.1109/CVPR.2019.00075]
Shan C B, Wei C B, Deng B, Huang J Q, Hua X S, Cheng X L and Liang K W. 2020. Tracklets predicting based adaptive graph tracking [EB/OL]. [2023-11-01]. https:// arxiv.org/pdf/2010.09015.pdfhttps://arxiv.org/pdf/2010.09015.pdf
Shao S, Zhao Z J, Li B X, Xiao T T, Yu G, Zhang X Y and Sun J. 2018. CrowdHuman: a benchmark for detecting human in a crowd [EB/OL]. [2023-11-01]. https://arxiv.org/pdf/1805.00123.pdfhttps://arxiv.org/pdf/1805.00123.pdf
Sun P Z, Cao J K, Jiang Y, Zhang R F, Xie E Z, Yuan Z H, Wang C H and Luo P. 2021a. TransTrack: multiple object tracking with Transformer [EB/OL]. [2023-11-01]. https://arxiv.org /pdf/2012.15460.pdfhttps://arxiv.org/pdf/2012.15460.pdf
Tokmakov P, Li J, Burgard W and Gaidon A. 2021. Learning to track with object permanence//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE: 10840-10849 [DOI: 10.1109/ICCV48922.2021.01068http://dx.doi.org/10.1109/ICCV48922.2021.01068]
Voigtlaender P, Krause M, Osep A, Luiten J, Sekar B B G, Geiger A and Leibe B. 2019. MOTS: multi-object tracking and segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 7934-7943 [DOI: 10.1109/CVPR.2019.00813http://dx.doi.org/10.1109/CVPR.2019.00813]
Wang Y X, Kitani K and Weng X S. 2021. Joint object detection and multi-object tracking with graph neural networks//Proceedings of 2021 IEEE International Conference on Robotics and Automation (ICRA). Xi’an, China: IEEE: 13708-13715 [DOI: 10.1109/ICRA4850 6.2021.9561110http://dx.doi.org/10.1109/ICRA48506.2021.9561110]
Wang Z D, Zheng L, Liu Y X, Li Y L and Wang S J. 2020. Towards real-time multi-object tracking//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 107-122 [DOI: 10.1007/978-3- 030-58621-8_7http://dx.doi.org/10.1007/978-3-030-58621-8_7]
Wojke N, Bewley A and Paulus D. 2017. Simple online and realtime tracking with a deep association metric//Proceedings of 2017 IEEE International Conference on Image Processing (ICIP). Beijing, China: IEEE: 3645-3649 [DOI: 10.1109/ICIP.2017.8296962http://dx.doi.org/10.1109/ICIP.2017.8296962]
Xiao T, Li S, Wang B C, Lin L and Wang X G. 2017. Joint detection and identification feature learning for person search//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 3376-3385 [DOI: 10.1109/CVPR.2017.360http://dx.doi.org/10.1109/CVPR.2017.360]
Xu Y H, Ban Y T, Delorme G, Gan C, Rus D and Alameda-Pineda X. 2023. TransCenter: Transformers with dense representations for multiple-object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6): 7820-7835 [DOI: 10.1109/TPAM I.2022.3225078http://dx.doi.org/10.1109/TPAMI.2022.3225078]
Yang F, Chang X, Sakti S, Wu Y and Nakamura S. 2021. ReMOT: a model-agnostic refinement for multiple object tracking. Image and Vision Computing, 106: #104091 [DOI: 10.1016/j.imavis.2020.104091http://dx.doi.org/10.1016/j.imavis.2020.104091]
Yue Y Y, Xu D, He K J and Zhang H. 2023. An adaptive occlusion-aware multiple targets tracking algorithm for low viewpoint. Journal of Image and Graphics, 28(2): 441-457
乐应英, 徐丹, 贺康建, 张浩. 2023. 低视点下遮挡自适应感知的多目标跟踪算法. 中国图象图形学报, 28(2): 441-457 [DOI: 10.11834/jig.210853http://dx.doi.org/10.11834/jig.210853]
Zeng F G, Dong B, Zhang Y A, Wang T C, Zhang X Y and Wei Y C. 2022. MOTR: end-to-end multiple-object tracking with Transformer//Proceedings of the 17th European Conference Computer Vision. Tel Aviv, Israel: Springer: 659-675 [DOI: 10.1007/978-3-031-19812-0_ 38http://dx.doi.org/10.1007/978-3-031-19812-0_38]
Zhang S S, Benenson R and Schiele B. 2017. CityPersons: a diverse dataset for pedestrian detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 4457-4465 [DOI: 10.1109/CVPR.2017.474http://dx.doi.org/10.1109/CVPR.2017.474]
Zhang Y F, Sun P Z, Jiang Y, Yu D D, Weng F C, Yuan Z H, Luo P, Liu W Y and Wang X G. 2022. ByteTrack: multi-object tracking by associating every detection box//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 1-21 [DOI: 10.1007/978-3-031-20047-2_1http://dx.doi.org/10.1007/978-3-031-20047-2_1]
Zhang Y F, Wang C Y, Wang X G, Zeng W J and Liu W Y. 2021. FairMOT: on the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 129(11): 3069-3087 [DOI: 10.1007/s11263-021-01513-4http://dx.doi.org/10.1007/s11263-021-01513-4]
Zheng L, Zhang H H, Sun S Y, Chandraker M, Yang Y and Tian Q. 2017. Person re-identification in the wild//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 3346-3355 [DOI: 10.1109/CVP R.2017.357http://dx.doi.org/10.1109/CVPR.2017.357]
Zheng L Y, Tang M, Chen Y Y, Zhu G B, Wang J Q and Lu H Q. 2021. Improving multiple object tracking with single object tracking//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA: IEEE: 2453-24 62 [DOI: 10.1109/CVPR46437.2021.00248http://dx.doi.org/10.1109/CVPR46437.2021.00248]
Zheng Z H, Wang P, Liu W, Li J Z, Ye R G and Ren D W. 2020. Distance-IoU Loss: faster and better learning for bounding box regression//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 12993-13000 [DOI: 10.1609/aaai.v34i07.6999http://dx.doi.org/10.1609/aaai.v34i07.6999]
Zhou X Y, Koltun V and Krähenbühl P. 2020. Tracking objects as points//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 474-490 [DOI: 10.1007/978-3-030-58548-8_28http://dx.doi.org/10.1007/978-3-030-58548-8_28]
相关作者
相关机构