多尺度特征融合与交叉指导的小样本语义分割

郭婧; 王飞

doi:10.11834/jig.230550

图像/视频语义分割 | 浏览量 : 0 下载量: 11 CSCD: 0

PDF
导出
分享
收藏
专辑

多尺度特征融合与交叉指导的小样本语义分割
Multiscale feature fusion and cross-guidance for few-shot semantic segmentation
2024年29卷第5期页码：1265-1276
纸质出版日期： 2024-05-16 ，
DOI： 10.11834/jig.230550
稿件说明：

移动端阅览

郭婧，王飞. 2024. 多尺度特征融合与交叉指导的小样本语义分割. 中国图象图形学报， 29(05):1265-1276

Guo Jing， Wang Fei. 2024. Multiscale feature fusion and cross-guidance for few-shot semantic segmentation. Journal of Image and Graphics， 29(05):1265-1276
郭婧，王飞. 2024. 多尺度特征融合与交叉指导的小样本语义分割. 中国图象图形学报， 29(05):1265-1276 DOI： 10.11834/jig.230550.

Guo Jing， Wang Fei. 2024. Multiscale feature fusion and cross-guidance for few-shot semantic segmentation. Journal of Image and Graphics， 29(05):1265-1276 DOI： 10.11834/jig.230550.

摘要

目的

构建支持分支和查询分支间的信息交互对于提升小样本语义分割的性能具有重要作用，提出一种多尺度特征融合与交叉指导的小样本语义分割算法。

方法

利用一组共享权重的主干网络将双分支输入图像映射到深度特征空间，并将输出的低层、中间层和高层特征进行尺度融合，构造多尺度特征；借助支持分支的掩码将支持特征分解成目标前景和背景特征图；设计了一种特征交互模块，在支持分支的目标前景和整个查询分支的特征图上建立信息交互，增强任务相关特征的表达能力，并利用掩码平均池化策略生成目标前景和背景区域的原型集；利用无参数的度量方法分别计算支持特征和原型集、查询特征与原型集之间的余弦相似度值，并根据相似度值给出对应图像的掩码。

结果

通过在PASCAL-5

（pattern analysis， statistical modeling and computational learning）和COCO-20

（common objects in context）开源数据集上进行实验，结果表明，利用VGG-16（Visual Geometry Group）、ResNet-50（residual neural network）和ResNet-101作为主干网络时，所提模型在1-way 1-shot任务中，分别获得50.2%、53.2%、57.1%和23.9%、35.1%、36.4%的平均交并比（mean intersection over union，mIoU），68.3%、69.4%、72.3%/和60.1%、62.4%、64.1%的前景背景二分类交并比（foreground and background intersection over union，FB-IoU）；在1-way 5-shot任务上，分别获得52.9%、55.7%、59.7%和32.5%、37.3%、38.3%的mIoU，69.7%、72.5%、74.6%和64.2%、66.2%、66.7%的FB-IoU。

结论

相比当前主流的小样本语义分割模型，所提模型在1-way 1-shot和1-way 5-shot任务中可以获得更高的mIoU和FB-IoU，综合性能提升效果显著。

Abstract

Objective

Few-shot semantic segmentation is one of the fundamental and challenging tasks in the field of computer vision. It aims to use a limited amount of annotated support samples to guide the segmentation of unknown objects in a query image. Compared with traditional semantic segmentation， few-shot semantic segmentation methods effectively alleviate problems， such as the high cost of per-pixel annotation greatly limiting the application of semantic segmentation technology in practical scenarios and the weak generalization ability of this model for novel class targets. The existing few-shot semantic segmentation methods mainly utilize the meta-learning architecture with dual-branch networks， where the support branch consists of the support images and their corresponding per-pixel labeled ground truth masks， and the query branch takes the input of the new image to be segmented， and both branches share the same semantic classes. The valuable information of support images in the support branch is extracted to guide the segmentation of unknown novel classes in query images. However， different instances of the same semantic class may have variations in appearance and scale， and the information extracted solely from the support branch is insufficient to guide the segmentation of unknown novel classes in query images. Although some researchers have attempted to improve the performance of few-shot semantic segmentation through bidirectional guidance， existing bidirectional guidance models overly rely on the pseudo masks predicted by the query branch in the intermediate stage. If the initial predictions of the query branch are poor， it can easily lead to a weak generalization of shared semantics， which is not conducive to improving segmentation performance.

Method

A multiscale feature fusion and cross-guidance network for few-shot semantic segmentation is proposed to alleviate these problems， attempting to construct the information interaction between the support branch and the query branch to improve the performance of the few-shot semantic segmentation task. First， a set of pretrained backbone networks with shared weights are used as feature extractors to map features from the support and query branch into the same deep feature space， and then the low-level， intermediate-level， and high-level features output by them are fused at multiple scales to construct a multiscale feature set， which enriches the semantic information of features and enhances the reliability of the feature expression. Second， with the help of the ground-truth mask of the support branch， the fused support features are decomposed into the target-related foreground feature maps and task-irrelevant background feature maps. Then， a feature interaction module is designed on the basis of the cross-attention mechanism， which establishes information interaction between the target-related foreground feature maps of the support branch and the entire query branch feature map， aiming to promote the interactivity between branches while enhancing the expressiveness of task-related features. In addition， a mask average pooling strategy is used on the interactive feature map to generate a target foreground region prototype set， and a background prototype set is generated on the support background feature map. Finally， the cosine similarity measure is used to calculate the similarity values between the support features and the prototype sets and between the query features and the prototype sets； then， the corresponding mask is generated on the basis of the maximum similarity value at each position.

Result

Experimental results on the classic PASCAL-5

（pattern analysis， statistical modeling and computational learning） dataset show that when Visual Geometry Group（VGG-16）， residual neural network（ResNet-50）， and ResNet-101 are used as backbone networks， the proposed few-shot semantic segmentation model achieves mean intersection over union （mIoU） scores of 50.2%/53.2%/57.1% and FB-IoU scores of 68.3%/69.4%/72.3% in

the one-way one-shot task and mIoU scores of 52.9%/55.7%/59.7% and FB-IoU scores of 69.7%/72.5%/74.6% in the one-way five-shot task. Results on the more challenging COCO-20i dataset show that the proposed model achieves mIoU scores of 23.9%/35.1%/36.4% and FB-IoU scores of 60.1%/62.4%/64.1% in the one-way one-shot task and mIoU scores of 32.5%/37.3%/38.3% and FB-IoU scores of 64.2%/66.2%/66.7% in the one-way five-shot task when VGG-16， ResNet-50， and ResNet-101 are used as backbone networks. Furthermore， the performance gains of the proposed few-shot semantic segmentation model on the PASCAL-5

and COCO-20

（common objects in context） datasets are competitive.

Conclusion

Compared with current mainstream few-shot semantic segmentation models， our model can achieve higher mIoU and FB-IoU in one-way one-shot and one-way five-shot tasks， with remarkable improvement in overall performance. Further validation shows that feature interaction between the support branch and query branch can effectively improve the model’s ability to locate and segment unknown new classes in query images， and using joint loss between support branch and query branch can promote information flow between dual-branch features， enhance the reliability of prototype expression， and achieve alignment of cross-branch prototype sets.

关键词

小样本语义分割多尺度特征融合跨分支交叉指导特征交互掩码平均池化

Keywords

few-shot semantic segmentationmultiscale feature fusioncross-branch cross-guidancefeature interactionmasked averaging pooling

references

Chang Z B， Lu Y G， Wang X W and Ran X C. 2022. MGNet： mutual-guidance network for few-shot semantic segmentation. Engineering Applications of Artificial Intelligence， 116： #105431 ［DOI： 10.1016/j.engappai.2022.105431http://dx.doi.org/10.1016/j.engappai.2022.105431］

Ding H H， Zhang H and Jiang X D. 2023. Self-regularized prototypical network for few-shot semantic segmentation. Pattern Recognition， 133： #109018 ［DOI： 10.1016/j.patcog.2022.109018http://dx.doi.org/10.1016/j.patcog.2022.109018］

Everingham M， Van Gool L， Williams C K I， Winn J and Zisserman A. 2010. The PASCAL visual object classes （VOC） challenge. International Journal of Computer Vision， 88（2）： 303-338 ［DOI： 10.1007/s11263-009-0275-4http://dx.doi.org/10.1007/s11263-009-0275-4］

Hariharan B， Arbel􀅡ez P， Bourdev L， Maji S and Malik J. 2011. Semantic contours from inverse detectors//Proceedings of 2011 International Conference on Computer Vision. Barcelona， Spain： IEEE： 991-998 ［DOI： 10.1109/ICCV.2011.6126343http://dx.doi.org/10.1109/ICCV.2011.6126343］

Hoyer L， Dai D X and Van Gool L. 2022. DAFormer： improving network architectures and training strategies for domain-adaptive semantic segmentation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 9914-9925 ［DOI： 10.1109/CVPR52688.2022.00969http://dx.doi.org/10.1109/CVPR52688.2022.00969］

Hu T， Yang P W， Zhang C L， Yu G， Mu Y D and Snoek C G M. 2019. Attention-based multi-context guiding for few-shot semantic segmentation//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu， USA： AAAI： 8441-8448 ［DOI： 10.1609/aaai.v33i01.33018441http://dx.doi.org/10.1609/aaai.v33i01.33018441］

Li F Y， Ye B and Qin C. 2023. Mutual attention mechanism-driven lightweight semantic segmentation network. Journal of Image and Graphics， 28（7）： 2068-2080

栗风永，叶彬，秦川. 2023. 互注意力机制驱动的轻量级图像语义分割网络. 中国图象图形学报， 28（7）： 2068-2080［DOI： 10.11834/jig.211127http://dx.doi.org/10.11834/jig.211127］

Li Z X， Zhang J， Wu J L and Ma H F. 2022. Semi-supervised adversarial learning based semantic image segmentation. Journal of Image and Graphics， 27（7）： 2157-2170

李志欣，张佳，吴璟莉，马慧芳. 2022. 基于半监督对抗学习的图像语义分割. 中国图象图形学报， 27（7）： 2157-2170 ［DOI： 10.11834/jig.200600http://dx.doi.org/10.11834/jig.200600］

Lin T Y， Maire M， Belongie S， Hays J， Perona P， Ramanan D， Doll􀅡r P and Zitnick C L. 2014. Microsoft COCO： common objects in context//Proceedings of the 13th European Conference on Computer Vision. Zurich， Switzerland： Springer： 740-755 ［DOI： 10.1007/978-3-319-10602-1_48http://dx.doi.org/10.1007/978-3-319-10602-1_48］

Liu B H， Ding Y， Jiao J B， Ji X Y and Ye Q X. 2021a. Anti-aliasing semantic reconstruction for few-shot semantic segmentation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 9742-9751 ［DOI： 10.1109/CVPR46437.2021.00962http://dx.doi.org/10.1109/CVPR46437.2021.00962］

Liu B H， Jiao J B and Ye Q X. 2021b. Harmonic feature activation for few-shot semantic segmentation. IEEE Transactions on Image Processing， 30： 3142-3153 ［DOI： 10.1109/TIP.2021.3058512http://dx.doi.org/10.1109/TIP.2021.3058512］

Liu Y F， Zhang X Y， Zhang S Y and He X M. 2020. Part-aware prototype network for few-shot semantic segmentation//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 142-158 ［DOI： 10.1007/978-3-030-58545-7_9http://dx.doi.org/10.1007/978-3-030-58545-7_9］

Liu Y X， Meng F M， Li H L， Yang J Y， Wu Q B and Xu L F. 2021. A few shot segmentation method combining global and local similarity. Journal of Beijing University of Aeronautics and Astronautics， 47（3）： 665-674

刘宇轩，孟凡满，李宏亮，杨嘉莹，吴庆波，许林峰. 2021. 一种结合全局和局部相似性的小样本分割方法. 北京航空航天大学学报， 47（3）： 665-674 ［DOI： 10.13700/j.bh.1001-5965.2020.0450http://dx.doi.org/10.13700/j.bh.1001-5965.2020.0450］

Nguyen K and Todorovic S. 2019. Feature weighting and boosting for few-shot segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 622-631 ［DOI： 10.1109/ICCV.2019.00071http://dx.doi.org/10.1109/ICCV.2019.00071］

Pambala A K， Dutta T and Biswas S. 2021. SML： semantic meta-learning for few-shot semantic segmentation. Pattern Recognition Letters， 147： 93-99 ［DOI： 10.1016/j.patrec.2021.03.036http://dx.doi.org/10.1016/j.patrec.2021.03.036］

Shaban A， Bansal S， Liu Z， Essa I and Boots B. 2017. One-shot learning for semantic segmentation//Proceedings of the 25th British Machine Vision Conference. London， UK： BMVC： 1029-1038 ［DOI： 10.5244/C.31.167http://dx.doi.org/10.5244/C.31.167］

Sheng H， Cong R X， Yang D， Chen R S， Wang S Z and Cui Z L. 2022. UrbanLF： a comprehensive light field dataset for semantic segmentation of urban scenes. IEEE Transactions on Circuits and Systems for Video Technology， 32（11）： 7880-7893 ［DOI： 10.1109/TCSVT.2022.3187664http://dx.doi.org/10.1109/TCSVT.2022.3187664］

Siam M， Oreshkin B and Jagersand M. 2019. AMP： adaptive masked proxies for few-shot segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 5248-5257 ［DOI： 10.1109/ICCV.2019.00535http://dx.doi.org/10.1109/ICCV.2019.00535］

Tian X， Wang L and Ding Q. 2019. Review of image semantic segmentation based on deep learning. Journal of Software， 30（2）： 440-468

田萱，王亮，丁琪. 2019. 基于深度学习的图像语义分割方法综述. 软件学报， 30（2）： 440-468 ［DOI： 10.13328/j.cnki.jos.005659http://dx.doi.org/10.13328/j.cnki.jos.005659］

Wang H C， Yang Y D， Cao X B， Zhen X T， Snoek C and Shao L. 2021. Variational prototype inference for few-shot semantic segmentation//Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa， USA： IEEE： 525-534 ［DOI： 10.1109/WACV48630.2021.00057http://dx.doi.org/10.1109/WACV48630.2021.00057］

Wang H C， Yang Y D， Jiang X L， Cao X B and Zhen X T. 2020a. You only need the image： unsupervised few-shot semantic segmentation with co-guidance network//Proceedings of 2020 IEEE International Conference on Image Processing （ICIP）. Abu Dhabi， United Arab Emirates： IEEE： 1496-1500 ［DOI： 10.1109/ICIP40778.2020.9190849http://dx.doi.org/10.1109/ICIP40778.2020.9190849］

Wang H C， Zhang X D， Hu Y T， Yang Y D， Cao X B and Zhen X T. 2020b. Few-shot semantic segmentation with democratic attention networks//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 730-746 ［DOI： 10.1007/978-3-030-58601-0_43http://dx.doi.org/10.1007/978-3-030-58601-0_43］

Wang K X， Liew J H， Zou Y T， Zhou D Q and Feng J S. 2019. PANet： few-shot image semantic segmentation with prototype alignment//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 9196-9205 ［DOI： 10.1109/ICCV.2019.00929http://dx.doi.org/10.1109/ICCV.2019.00929］

Xie G S， Liu J， Xiong H and Shao L. 2021a. Scale-aware graph neural network for few-shot semantic segmentation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 5471-5480 ［DOI： 10.1109/CVPR46437.2021.00543http://dx.doi.org/10.1109/CVPR46437.2021.00543］

Xie G S， Xiong H， Liu J， Yao Y Z and Shao L. 2021b. Few-shot semantic segmentation with cyclic memory network//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 7273-7282 ［DOI： 10.1109/ICCV48922.2021.00720http://dx.doi.org/10.1109/ICCV48922.2021.00720］

Yang B Y， Liu C， Li B H， Jiao J B and Ye Q X. 2020a. Prototype mixture models for few-shot semantic segmentation//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 763-778 ［DOI： 10.1007/978-3-030-58598-3_45http://dx.doi.org/10.1007/978-3-030-58598-3_45］

Yang L H， Zhuo W， Qi L， Shi Y H and Gao Y. 2021. Mining latent classes for few-shot segmentation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 8701-8710 ［DOI： 10.1109/ICCV48922.2021.00860http://dx.doi.org/10.1109/ICCV48922.2021.00860］

Yang Y W， Meng F M， Li H L， Wu Q B， Xu X L and Chen S. 2020b. A new local transformation module for few-shot segmentation//Proceedings of the 26th International Conference on Multimedia Modeling. Daejeon， Korea （South）： Springer： 76-87 ［DOI： 10.1007/978-3-030-37734-2_7http://dx.doi.org/10.1007/978-3-030-37734-2_7］

Zhang B F， Xiao J M and Qin T. 2021. Self-guided and cross-guided learning for few-shot segmentation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 8308-8317 ［DOI： 10.1109/CVPR46437.2021.00821http://dx.doi.org/10.1109/CVPR46437.2021.00821］

Zhang X L， Wei Y C， Li Z， Yan C G and Yang Y. 2022. Rich embedding features for one-shot semantic segmentation. IEEE Transactions on Neural Networks and Learning Systems， 33（11）： 6484-6493 ［DOI： 10.1109/TNNLS.2021.3081693http://dx.doi.org/10.1109/TNNLS.2021.3081693］

Zhang X L， Wei Y C， Yang Y and Huang T S. 2020. SG-one： similarity guidance network for one-shot semantic segmentation. IEEE Transactions on Cybernetics， 50（9）： 3855-3865 ［DOI： 10.1109/TCYB.2020.2992433http://dx.doi.org/10.1109/TCYB.2020.2992433］

Zhuang Y Q， Yang F， Tao L， Ma C， Zhang Z W， Li Y， Jia H Z， Xie X D and Gao W. 2018. Dense relation network： learning consistent and context-aware representation for semantic image segmentation//Proceedings of the 25th IEEE International Conference on Image Processing （ICIP）. Athens， Greece： IEEE： 3698-3702 ［DOI： 10.1109/ICIP.2018.8451830http://dx.doi.org/10.1109/ICIP.2018.8451830］

文章被引用时，请邮件提醒。

提交

深度学习多模态图像语义分割前沿进展