多帧时空注意力引导的半监督视频分割

罗思涵; 袁夏; 梁永顺

doi:10.11834/jig.230606

图像/视频语义分割 | 浏览量 : 0 下载量: 324 CSCD: 0

PDF
导出
分享
收藏
专辑

多帧时空注意力引导的半监督视频分割
Multiframe spatiotemporal attention-guided semisupervised video segmentation
2024年29卷第5期页码：1233-1251
收稿日期：2023-09-11，

修回日期：2024-01-03，

纸质出版日期：2024-05-16
DOI： 10.11834/jig.230606
稿件说明：

移动端阅览

罗思涵，袁夏，梁永顺. 2024. 多帧时空注意力引导的半监督视频分割. 中国图象图形学报， 29(05):1233-1251 DOI： 10.11834/jig.230606.

Luo Sihan， Yuan Xia， Liang Yongshun. 2024. Multiframe spatiotemporal attention-guided semisupervised video segmentation. Journal of Image and Graphics， 29(05):1233-1251 DOI： 10.11834/jig.230606.

摘要

目的

传统的半监督视频分割多是基于光流的方法建模关键帧与当前帧之间的特征关联。而光流法在使用过程中容易因遮挡、特殊纹理等情况产生错误，从而导致多帧融合存在问题。为了更好地融合多帧特征，本文提取第1帧的外观特征信息与邻近关键帧的位置信息，通过Transformer和改进的PAN（path aggregation network）模块进行特征融合，从而基于多帧时空注意力学习并融合多帧的特征。

方法

多帧时空注意力引导的半监督视频分割方法由视频预处理（即外观特征提取网络和当前帧特征提取网络）以及基于Transformer和改进的PAN模块的特征融合两部分构成。具体包括以下步骤：构建一个外观信息特征提取网络，用于提取第1帧图像的外观信息；构建一个当前帧特征提取网络，通过Transformer模块对当前帧与第1帧的特征进行融合，使用第1帧的外观信息指导当前帧特征信息的提取；借助邻近数帧掩码图与当前帧特征图进行局部特征匹配，决策出与当前帧位置信息相关性较大的数帧作为邻近关键帧，用来指导当前帧位置信息的提取；借助改进的PAN特征聚合模块，将深层语义信息与浅层语义信息进行融合。

结果

本文算法

在DAVIS（densely annotated video segmentation）-2016数据集上的

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623665&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623654&type=

1.52400005

2.53999996

和

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623671&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623656&type=

2.03200006

2.28600001

得分为81.5%和80.9%，在DAVIS-2017数据集上为78.4%和77.9% ，均优于对比方法。本文算法的运行速度为22帧/s，对比实验中排名第2，比PLM（pixel-level matching）算法低1.6%。在YouTube-VOS（video object segmentation）数据集上也取得了有竞争力的结果，

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623665&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623654&type=

1.52400005

2.53999996

和

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623671&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623656&type=

2.03200006

2.28600001

的平均值达到了71.2%，领先于对比方法。

结论

多帧时空注意力引导的半监督视频分割算法在对目标物体进行分割的同时，能有效融合全局与局部信息，减少细节信息丢失，在保持较高效率的同时能有效提高半监督视频分割的准确率。

Abstract

Objective

Video object segmentation （VOS） aims to provide high-quality segmentation of target object instances throughout an input video sequence， obtaining pixel-level masks of the target objects， thereby finely segmenting the target from the background images. Compared with tasks such as object tracking and detection， which involve bounding-box level tasks （using rectangular frames to select targets）， VOS has pixel-level accuracy， which is more conducive to locating the target accurately and outlining the details of the target’s edge. Depending on the supervision information provided， VOS can be divided into three scenarios： semisupervised VOS， interactive VOS， and unsupervised VOS. In this study， we focus on the semisupervised task. In the scenario of semisupervised VOS， pixel-level annotated masks of the first frame of the video are provided， and subsequent prediction frames can fully utilize the annotated mask of the first frame to assist in computing the segmentation results of each prediction frame. With the development of deep neural network technology， current semisupervised VOS methods are mostly based on deep learning. These methods can be divided into the following three categories： detection-， matching-， and propagation-based methods. Detection-based object segmentation algorithms treat VOS tasks as image object segmentation tasks without considering the temporal association of videos， believing that only a strong frame-level object detector and segmenter are needed to perform target segmentation frame by frame. Matching-based works typically segment video objects by calculating pixel-level matching scores or semantic feature matching scores between the template frame and the current prediction frame. Propagation-based methods propagate the multiframe feature information before the prediction frame to the prediction frame and calculate the correlation between the prediction frame feature and the previous frame feature to represent video context information. This context information locates the key areas of the entire video and can guide single-frame image segmentation. Motion-based propagation methods have two types： one introduces optical flow to train the VOS model， and the other learns deep target features from the previous frame’s target mask and refines the target mask in the current frame. Existing semisupervised video segmentation is mostly based on optical flow methods to model the feature association between key frames and the current frame. However， the optical flow method is prone to errors due to occlusions， special textures and other situations， leading to issues in multiframe fusion. Aiming to integrate multiframe features， this study extracts the appearance feature information of the first frame and the positional information of the adjacent key frames and fuses the features through the Transformer and the improved path aggregation network （PAN） module， thereby learning and integrating features based on multiframe spatiotemporal attention.

Method

In this study， we propose a semisupervised VOS method based on the fusion of features using the Transformer mechanism. This method integrates multiframe appearance feature information and positional feature information. Specifically， the algorithm is divided into the following steps： 1） appearance information feature extraction network： first， we construct an appearance information feature extraction network. This module， based on CSPDarknet53， is modified and consists of CBS （convolution， batch normalization， and Silu） modules， cross stage partial residual network（CSPRes） modules， residual spatial pyra

mid pooling（ResSPP） modules， and receptive field enhancement and pyramid pooling（REP） modules. The first frame of the video serves as the input， which is passed through three CBS modules to obtain the shallow features

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623591&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623579&type=

2.79399991

3.21733332

. These features are then processed through six CSPRes modules， followed by a ResSPP module， and finally another CBS module to produce the output

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623600&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623598&type=

3.13266683

3.21733332

， representing the appearance information extracted from the first frame of the video. 2） Current frame feature extraction network： we then build a network to extract features from the current frame. This network comprises three cascaded CBS modules， which are used to extract the current frame’s feature information. Simultaneously， the Transformer feature fusion module merges the features of the current frame with those of the first frame. The appearance infor

mation from the first frame guides the extraction of feature information from the current frame. Within this， the Transformer module consists of an encoder and a decoder. 3） Local feature matching： with the aid of the mask maps from several adjacent frames and the feature map of the current frame， local feature matching is performed. This process determines the frames with positional information that has a strong correlation with the current frame and treats them as nearby keyframes. These keyframes are then used to guide the extraction of positional information from the current frame. 4） Enhanced PAN feature aggregation module： finally， the input feature maps are passed through a spatial pyramid pooling （SPP） module that contains max-pooling layers of sizes

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623606&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623613&type=

7.11199999

2.28600001

，

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623610&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623608&type=

7.11199999

2.28600001

， and

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623624&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623612&type=

7.11199999

2.28600001

. The improved PAN structure powerfully fuses the features across different layers. The feature maps undergo a concatenation operation， which integrates deep semantic information with shallow semantic information. By integrating these steps， the proposed method aims to improve the accuracy and robustness of VOS tasks.

Result

In the experimental section， the proposed method did not require online fine tuning and postprocessing. Our algorithm was compared with the current 10 mainstream methods on the DAVIS-2016 and DAVIS-2017 datasets and with five methods on the YouTube-VOS dataset. On the DAVIS-2016 dataset， the algorithm achieved commendable performance， with a region similarity score

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623665&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623654&type=

1.52400005

2.53999996

and contour accuracy score

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623671&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623656&type=

2.03200006

2.28600001

of 81.5% and 80.9%， respectively， which is an improvement of 1.2% over the highest-performing comparison method. On the DAVIS-2017 dataset，

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623665&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623654&type=

1.52400005

2.53999996

and

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623671&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623656&type=

2.03200006

2.28600001

scores reached 78.4% and 77.9%， respectively， an improvement of 1.3% over the highest-performing comparison method. The running speed of our algorithm is 22 frame/s， ranking it second， slightly lower than the pixel-level matching （PLM） algorithm by 1.6%. On the YouTube-VOS dataset， competitive results were also achieved， with average

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623665&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623654&type=

1.52400005

2.53999996

and

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623671&type=

https://html.publish.founderss.cn/rc-pub/api/common/picture?pictureId=58623656&type=

2.03200006

2.28600001

scores reaching 71.2%， surpassing all comparison methods.

Conclusion

The semisupervised video segmentation algorithm based on multiframe spatiotemporal attention can effectively integrate global and local information while segmenting target objects. Thus， the loss of detailed information is minimized； while maintaining high efficiency， it can also effectively improve the accuracy of semisupervised video segmentation.

关键词

Keywords

references

Bao L C ， Wu B Y and Liu W . 2018 . CNN in MRF： video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR） . Salt Lake City， USA ： IEEE： 5977 - 5986 ［ DOI： 10.1109/CVPR.2018.00626 http://dx.doi.org/10.1109/CVPR.2018.00626 ］

Bhat G ， Lawin F J ， Danelljan M ， Robinson A ， Felsberg M ， Van Gool L and Timofte R . 2020 . Learning what to learn for video object segmentation // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 777 - 794 ［ DOI： 10.1007/978-3-030-58536-5_46 http://dx.doi.org/10.1007/978-3-030-58536-5_46 ］

Bochkovskiy A ， Wang C Y and Liao H Y M . 2020 . Yolov4： optimal speed and accuracy of object detection ［EB/OL］. ［ 2023-07-21 ］. https://arxiv.org/pdf/2004.10934.pdf https://arxiv.org/pdf/2004.10934.pdf

Bolme D S ， Beveridge J R ， Draper B A and Lui Y M . 2010 . Visual object tracking using adaptive correlation filters // Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition . San Francisco， USA ： IEEE： 2544 - 2550 ［ DOI： 10.1109/CVPR.2010.5539960 http://dx.doi.org/10.1109/CVPR.2010.5539960 ］

Caelles S ， Maninis K K ， Pont-Tuset J ， Leal-Taixé L ， Cremers D and Van Gool L . 2017 . One-shot video object segmentation // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Honolulu， USA ： IEEE： 5320 - 5329 ［ DOI： 10.1109/CVPR.2017.565 http://dx.doi.org/10.1109/CVPR.2017.565 ］

Chen X ， Li Z X ， Yuan Y ， Yu G ， Shen J X and Qi D L . 2020 . State-aware tracker for real-time video object segmentation // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 9381 - 9390 ［ DOI： 10.1109/CVPR42600.2020.00940 http://dx.doi.org/10.1109/CVPR42600.2020.00940 ］

Chen Y H ， Pont-Tuset J ， Montes A and Van Gool L . 2018 . Blazingly fast video object segmentation with pixel-wise metric learning // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 1189 - 1198 ［ DOI： 10.1109/CVPR.2018.00130 http://dx.doi.org/10.1109/CVPR.2018.00130 ］

Dosovitskiy A ， Fischer P ， Ilg E ， Häusser P ， Hazirbas C ， Golkov V ， Smagt P V D ， Cremers D and Brox T . 2015 . Flownet： learning optical ﬂow with convolutional networks // Proceedings of 2015 IEEE International Conference on Computer Vision （ICCV） . Santiago， Chile ： IEEE： 2758 - 2766 ［ DOI： 10.1109/ICCV.2015.316 http://dx.doi.org/10.1109/ICCV.2015.316 ］

Fan H ， Lin L T ， Yang F ， Chu P ， Deng G ， Yu S J ， Bai H X ， Xu Y ， Liao C Y and Ling H B . 2019 . LaSOT： a high-quality benchmark for large-scale single object tracking // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 5369 - 5378 ［ DOI： 10.1109/CVPR.2019.00552 http://dx.doi.org/10.1109/CVPR.2019.00552 ］

Girgensohn A ， Boreczky J ， Chiu P ， Doherty J ， Foote J ， Golovchinsky G ， Uchihashi S and Wilcox L . 2000 . A semi-automatic approach to home video editing // Proceedings of the 13th Annual ACM Symposium on User Interface software and Technology . San Diego， USA ： ACM： 81 - 89 ［ DOI： 10.1145/354401.354415 http://dx.doi.org/10.1145/354401.354415 ］

He K M ， Gkioxari G ， Doll􀅡r P and Girshick R . 2017 . Mask R-CNN // Proceedings of 2017 IEEE International Conference on Computer Vision （ICCV） . Venice， Italy ： IEEE： 2980 - 2988 ［ DOI： 10.1109/ICCV.2017.322 http://dx.doi.org/10.1109/ICCV.2017.322 ］

He K M ， Zhang X Y ， Ren S Q and Sun J . 2015 . Spatial pyramid pooling in deep convolutional networks for visual recognition . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 37 （ 9 ）： 1904 - 1916 ［ DOI： 10.1109/TPAMI.2015.2389824 http://dx.doi.org/10.1109/TPAMI.2015.2389824 ］

He K M ， Zhang X Y ， Ren S Q and Sun J . 2016 . Deep residual learning for image recognition // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 770 - 778 ［ DOI： 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ］

Hu Y T ， Huang J B and Schwing A G . 2018 . VideoMatch： matching based video object segmentation // Proceedings of the 15th European Conference on Computer Vision . Munich， Germany ： Springer： 56 - 73 ［ DOI： 10.1007/978-3-030-01237-3_4 http://dx.doi.org/10.1007/978-3-030-01237-3_4 ］

Ilg E ， Mayer N ， Saikia T ， Keuper M ， Dosovitskiy A and Brox T . 2017 . FlowNet 2.0： Evolution of optical flow estimation with deep networks // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， USA ： IEEE： 1647 - 1655 ［ DOI： 10.1109/CVPR.2017.179 http://dx.doi.org/10.1109/CVPR.2017.179 ］

Khoreva A ， Benenson R ， Ilg E ， Brox T and Schiele B . 2017 . Lucid data dreaming for object tracking . International journal of computer vision ， 127 （ 9 ）： 1175 - 1197 ［ DOI： 10.1007/s11263-019-01164-6 http://dx.doi.org/10.1007/s11263-019-01164-6 ］

LeCun Y ， Bottou L ， Bengio Y and Haffner P . 1998 . Gradient-based learning applied to document recognition . Proceedings of the IEEE ， 86 （ 11 ）： 2278 - 2324 ［ DOI： 10.1109/5.726791 http://dx.doi.org/10.1109/5.726791 ］

Li H ， Liu K H ， Liu J J and Zhang X Y . 2021 . Multitask framework for video object tracking and segmentation combined with multi-scale interframe information . Journal of Image and Graphics ， 26 （ 1 ）： 101 - 112

李瀚，刘坤华，刘嘉杰，张晓晔 . 2021 . 实时视觉目标跟踪与视频对象分割多任务框架 . 中国图象图形学报， 26 （ 1 ）： 101 - 112 ［ DOI： 10.11834/jig.200519 http://dx.doi.org/10.11834/jig.200519 ］

Li X X and Loy C C . 2018 . Video object segmentation with joint re-identification and attention-aware mask propagation // Proceedings of the 15th European Conference on Computer Vision . Munich， Germany ： Springer： 93 - 110 ［ DOI： 10.1007/978-3-030-01219-9_6 http://dx.doi.org/10.1007/978-3-030-01219-9_6 ］

Lin J ， Gan C and Han S . 2019 . TSM： temporal shift module for eﬃcient video understanding // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 7082 - 7092 ［ DOI： 10.1109/ICCV.2019.00718 http://dx.doi.org/10.1109/ICCV.2019.00718 ］

Lin T Y ， Doll􀅡r P ， Girshick R ， He K ， Hariharan B and Belongie S . 2017 . Feature pyramid networks for object detection // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， USA ： IEEE： 2117 - 2125 ［ DOI： 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ］

Liu S ， Qi L ， Qin H F ， Shi J P and Jia J Y . 2018 . Path aggregation network for instance segmentation // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 8759 - 8768 ［ DOI： 10.1109/CVPR.2018.00913 http://dx.doi.org/10.1109/CVPR.2018.00913 ］

Liu Y T ， Zhang K H ， Fan J Q and Liu Q S . 2022 . Spatiotemporal feature fusion network based multi-objects tracking and segmentation . Journal of Image and Graphics ， 27 （ 11 ）： 3257 - 3266

刘雨亭，张开华，樊佳庆，刘青山 . 2022 . 时空特征融合网络的多目标跟踪与分割 . 中国图象图形学报， 27 （ 11 ）： 3257 - 3266 ［ DOI： 10.11834/jig.210417 http://dx.doi.org/10.11834/jig.210417 ］

Luiten J ， Voigtlaender P and Leibe B . 2019 . PReMVOS： proposal-generation， refinement and merging for video object segmentation // Proceedings of the 14th Asian Conference on Computer Vision （ACCV） . Perth， Australia ： Springer： 565 - 580 ［ DOI： 10.1007/978-3-030-20870-7_35 http://dx.doi.org/10.1007/978-3-030-20870-7_35 ］

Maninis K K ， Caelles S ， Chen Y ， Pont-Tuset J ， Leal-Taixé L ， Cremers D and Van Gool L . 2019 . Video object segmentation without temporal information . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 41 （ 6 ）： 1515 - 1530 ［ DOI： 10.1109/TPAMI.2018.2838670 http://dx.doi.org/10.1109/TPAMI.2018.2838670 ］

Oh S W ， Lee J Y ， Sunkavalli K and Kim S J . 2018 . Fast video object segmentation by reference-guided mask propagation // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 7376 - 7385 ［ DOI： 10.1109/CVPR.2018.00770 http://dx.doi.org/10.1109/CVPR.2018.00770 ］

Oh S W ， Lee J Y ， Xu N and Kim S J . 2019 . Video object segmentation using space-time memory networks // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 9225 - 9234 ［ DOI： 10.1109/ICCV.2019.00932 http://dx.doi.org/10.1109/ICCV.2019.00932 ］

Perazzi F ， Khoreva A ， Benenson R ， Schiele B and Sorkine-Hornung A . 2017 . Learning video object segmentation from static images // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Honolulu， USA ： IEEE： 3491 - 3500 ［ DOI： 10.1109/CVPR.2017.372 http://dx.doi.org/10.1109/CVPR.2017.372 ］

Perazzi F ， Pont-Tuset J ， McWilliams B ， Van Gool L ， Gross M and Sorkine-Hornung A . 2016 . A benchmark dataset and evaluation. methodology for video object segmentation // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 724 - 732 ［ DOI： 10.1109/CVPR.2016.85 http://dx.doi.org/10.1109/CVPR.2016.85 ］

Pont-Tuset J ， Perazzi F ， Caelles S ， Arbel􀅡ez P ， Sorkine-Hornung A and Van Gool L . 2017 . The 2017 DAVIS challenge on video object segmentation ［EB/OL］. ［ 2023-07-21 ］. https://arxiv.org/pdf/1704.00675.pdf https://arxiv.org/pdf/1704.00675.pdf

Redmon J and Farhadi A . 2017 . YOLO9000： better， faster， stronger // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， USA ： IEEE： 6517 - 6525 ［ DOI： 10.1109/CVPR.2017.690 http://dx.doi.org/10.1109/CVPR.2017.690 ］

Redmon J and Farhadi A . 2018 . Yolov3： an incremental improvement ［EB/OL］. ［ 2023-07-21 ］. https://arxiv.org/pdf/1804.02767.pdf https://arxiv.org/pdf/1804.02767.pdf

Ren S C ， Liu W X ， Liu Y T ， Chen H X ， Han G Q and He S F . 2021 . Reciprocal transformations for unsupervised video object segmentation // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 15430 - 15439 ［ DOI： 10.1109/CVPR46437.2021.01520 http://dx.doi.org/10.1109/CVPR46437.2021.01520 ］

Shin Yoon J ， Rameau F ， Kim J ， Lee S ， Shin S and So Kweon I . 2017 . Pixel-level matching for video object segmentation using convolutional neural networks // Proceedings of 2017 IEEE International Conference on Computer Vision . Venice， Italy ： IEEE： 2186 - 2195 ［ DOI： 10.1109/ICCV.2017.238 http://dx.doi.org/10.1109/ICCV.2017.238 ］

Simonyan K and Zisserman A . 2015 . Very deep convolutional networks for large-scale image recognition // Proceedings of the 3rd International Conference on Learning Representations . San Diego， USA ： ICLR： 1 - 14 ［ DOI： 10.48550/arXiv.1409.1556 http://dx.doi.org/10.48550/arXiv.1409.1556 ］

Tan M X and Le Q V . 2019 . EfficientNet： rethinking model scaling for convolutional neural networks // Proceedings of the 36th International Conference on Machine Learning . Long Beach， USA ： PMLR： 6105 - 6114

Voigtlaender P ， Chai Y N ， Schroff F ， Adam H ， Leibe B and Chen L C . 2019 . FEELVOS： fast end-to-end embedding learning for video object segmentation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 9473 - 9482 ［ DOI： 10.1109/CVPR.2019.00971 http://dx.doi.org/10.1109/CVPR.2019.00971 ］

Voigtlaender P and Leibe B . 2017 . Online adaptation of convolutional neural networks for video object segmentation // Proceedings of the British Machine Vision Conference （BMVC） . London， UK ： BMVA Press： 1 - 16 ［ DOI： 10.5244/C.31.116 http://dx.doi.org/10.5244/C.31.116 ］

Wang S Y ， Hou Z Q ， Wang N ， Li F C ， Pu L and Ma S G . 2021 . Video object segmentation algorithm based on adaptive template updating and multi-feature fusion . Opto-Electronic Engineering ， 48 （ 10 ）： # 210193

汪水源，侯志强，王囡，李富成，蒲磊，马素刚 . 2021 . 基于自适应模板更新与多特征融合的视频目标分割算法 . 光电工程， 48 （ 10 ）： # 210193 ［ DOI： 10.12086/oee.2021.210193 http://dx.doi.org/10.12086/oee.2021.210193 ］

Wang Z Q ， Xu J ， Liu L ， Zhu F and Shao L . 2019 . RANet： ranking attention network for fast video object segmentation // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision （ICCV） . Seoul， Korea （South）： IEEE： 3977 - 3986 ［ DOI： 10.1109/ICCV.2019.00408 http://dx.doi.org/10.1109/ICCV.2019.00408 ］

Woo S ， Park J ， Lee J Y and Kweon I S . 2018 . CBAM： Convolutional block attention module // Proceedings of the 15th European Conference on Computer Vision （ECCV） . Munich， Germany ： Springer： 3 - 19 ［ DOI： 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ］

Wu C Y ， Feichtenhofer C ， Fan H Q ， He K M ， Krahenbuhl P and Girshick R . 2019 . Long-term feature banks for detailed video understanding // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 284 - 293 ［ DOI： 10.1109/CVPR.2019.00037 http://dx.doi.org/10.1109/CVPR.2019.00037 ］

Xiao T ， Li S ， Wang B C ， Lin L and Wang X G . 2017 . Joint detection and identification feature learning for person search // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR） . Honolulu， USA ： IEEE： 3376 - 3385 ［ DOI： 10.1109/CVPR.2017.360 http://dx.doi.org/10.1109/CVPR.2017.360 ］

Xie H Z ， Yao H X ， Zhou S C ， Zhang S P and Sun W X . 2021 . Efficient regional memory network for video object segmentation // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 1286 - 1295 ［ DOI： 10.1109/CVPR46437. 2021. 00134 http://dx.doi.org/10.1109/CVPR46437.2021.00134 ］

Xu N ， Yang L J ， Fan Y C ， Yue D C ， Liang Y C ， Yang J C and Huang T . 2018 . YouTube-VOS： a large-scale video object segmentation benchmark ［EB/OL］. ［ 2023-07-21 ］. https://arxiv.org/pdf/1809.03327.pdf https://arxiv.org/pdf/1809.03327.pdf

Yang Z X ， Wei Y C and Yang Y . 2020 . Collaborative video object segmentation by foreground-background integration // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 332 - 348 ［ DOI： 10.1007/978-3-030-58558-7_20 http://dx.doi.org/10.1007/978-3-030-58558-7_20 ］

Yu Y ， Yuan J L ， Mittal G ， Li F X and Chen M . 2022 . BATMAN： bilateral attention Transformer in motion-appearance neighboring space for video object segmentation // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 612 - 629 ［ DOI： 10.1007/978-3-031-19818-2_35 http://dx.doi.org/10.1007/978-3-031-19818-2_35 ］

Zhou T F ， Wang S Z ， Zhou Y ， Yao Y Z ， Li J W and Shao L . 2020 . Motion-attentive transition for zero-shot video object segmentation // Proceedings of the AAAI Conference on Artificial Intelligence . New York， USA ： AAAI： 13066 - 13073 ［ DOI： 10.1609/aaai.v34i07.7008 http://dx.doi.org/10.1609/aaai.v34i07.7008 ］

Zhou Z K ， Mao K G ， Pei W J ， Wang H P ， Wang Y W and He Z Y . 2023 . Reliability-hierarchical memory network for scribble-supervised video object segmentation ［EB/OL］. ［ 2023-09-07 ］. https://arxiv.org/pdf/2303.14384.pdf https://arxiv.org/pdf/2303.14384.pdf

Zolfaghari M ， Singh K and Brox T . 2018 . ECO： Efficient convolutional network for online video understanding // Proceedings of the 15th European Conference on Computer Vision . Munich， Germany ： Springer： 713 - 730 ［ DOI： 10.1007/978-3-030-01216-8_43 http://dx.doi.org/10.1007/978-3-030-01216-8_43 ］

文章被引用时，请邮件提醒。

提交

融合特征增强与互补的手物姿态估计方法

多特征聚合的边界引导视频图像显著目标检测

面向高度近视条纹损伤的深监督特征聚合网络

DeepLabv3plus-IRCNet：小目标特征提取的图像语义分割

采用时空注意力机制的人脸微表情识别