面向虚拟视点绘制空洞填充的渐进式迭代网络
Progressive iteration network for hole filling in virtual view rendering
- 2024年29卷第7期 页码:1948-1959
纸质出版日期: 2024-07-16
DOI: 10.11834/jig.230290
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-07-16 ,
移动端阅览
刘家希, 周洋, 林坤, 殷海兵, 唐向宏. 2024. 面向虚拟视点绘制空洞填充的渐进式迭代网络. 中国图象图形学报, 29(07):1948-1959
Liu Jiaxi, Zhou Yang, Lin Kun, Yin Haibing, Tang Xianghong. 2024. Progressive iteration network for hole filling in virtual view rendering. Journal of Image and Graphics, 29(07):1948-1959
目的
2
基于深度图像的绘制(depth image based rendering,DIBR)是合成虚拟视点图像的关键技术,但在绘制过程中虚拟视图会出现裂纹和空洞问题。针对传统算法导致大面积空洞区域像素混叠和模糊的问题,将深度学习模型应用于虚拟视点绘制空洞填充领域,提出了面向虚拟视点绘制空洞填充的渐进式迭代网络。
方法
2
首先,使用部分卷积对大面积空洞进行渐进修复。然后采用U-Net网络作为主干对空洞区域进行编解码操作,同时嵌入知识一致注意力模块加强网络对有效特征的利用。接着通过加权合并方法来融合每次渐进式迭代生成的特征图,保护早期特征不被破坏。最后结合上下文特征传播损失提高网络匹配过程中的鲁棒性。
结果
2
在微软实验室提供的2个多视点3D(three-dimension)视频序列以及4个3D-HEVC(3D high efficiency video coding)序列上进行定量与定性评估实验,以峰值信噪比(peak signal-to-noise ratio,PSNR)和结构相似性(structural similarity,SSIM)作为指标。实验结果表明,本文算法在主观和客观上均优于已有方法。相比于性能第2的模型,在Ballet、Breakdancers、Lovebird1和Poznan_Street数据集上,本文算法的PSNR提升了1.302 dB、1.728 dB、0.068 dB和0.766 dB,SSIM提升了0.007、0.002、0.002和0.033;在Newspaper和Kendo数据集中,PSNR提升了0.418 dB和0.793 dB,SSIM提升了0.011和0.007。同时进行消融实验验证了本文方法的有效性。
结论
2
本文提出的渐进式迭代网络模型,解决了虚拟视点绘制空洞填充领域中传统算法过程烦琐和前景纹理渗透严重的问题,取得了极具竞争力的填充结果。
Objective
2
Depth image-based rendering (DIBR) makes full use of the depth information in a reference image and can combine color image and depth information organically, which is faster and less complex than the general rendering method. Therefore, DIBR is selected by ISO as the primary virtual view rendering technology in 3D multimedia video. The principal challenge associated with virtual view rendering technology is the 3D warping of the reference view, which leads to exposure of the background that was previously obstructed by the foreground. As a result, certain areas appear as holes in the virtual view due to the absence of pixel values. The search for an effective solution to address missing regions in the rendered view image is a critical challenge in virtual view rendering technology. The traditional algorithms mainly fill the holes based on the space-domain consistency and time-domain consistency methods. Filtering can effectively remove the cracks and some of the holes but cannot handle the large-area holes. The patch-based method can fill large-area holes, but the process is tedious, the amount of data is too large, and the accuracy of searching for the best matching patch is not high, which may lead to the texture belonging to the foreground being incorrectly filled to the hole area belonging to the background. Based on the time-domain consistency method, a model is developed to reconstruct the vacant part of the background using various models, and the foreground part is repositioned to the virtual viewpoint location to reduce the computational complexity and increase the adaptability to the scene. However, the moving camera scene contains both stationary and moving objects, which easily causes some parts of the foreground to be modeled as the background, resulting in the mixing of foreground and background pixels. Therefore, a deep learning model is applied to the field of hole filling in virtual view rendering, and a progressive iterative network for hole filling in virtual view rendering is proposed to address the problem of traditional algorithms leading to pixel blending and blurring in large hole regions.
Method
2
In this study, a progressive iterative network based on convolutional neural network is built. The network model mainly consists of a knowledge consistent attention module, a contextual feature propagation loss module, and a weighted merging module. First, partial convolutions are used in the initial stage of the network for progressive repair of large area holes. The partial convolutions are operated using only the valid pixels in the hole region, and the updated masks are retained throughout the iterations until they are reduced and updated in the next iteration, which is beneficial to the extraction of shallow valid features. Then, the U-Net network is used as the backbone to codify and decode the empty regions and cascade the shallow and deep information by introducing skip connections to tackle the problem of missing information. To select effective features in the network, we embed a knowledge consistent attention module. One benefit of this attention module is that it measures the attention score by weighing the current score with the score obtained from the previous iteration, which establishes the correlation between the front and back frame patches and effectively avoids the problem of foreground and background pixel blending in the traditional algorithm. The contextual feature propagation loss module is also used in a progressive iterative network with an attention module. This module plays a complementary role to the knowledge consistent attention module, reducing the difference between the reconstructed images in the encoder and decoder and enhancing the robustness of the network matching process. In addition, it allows for the creation of semantically consistent patches to fill in background holes by utilizing auxiliary images as guidance. Furthermore, we employ a pre-trained Visual Geometry Group (VGG-16) feature extractor to facilitate the joint guidance of our model using L1 loss, perceptual loss, style loss, and smoothing loss, ultimately enhancing the resemblance between reference and target views. Lastly, the feature maps produced in each successive iteration are integrated via a weighted merging approach. This process involves the development of an adaptive map through the learning process. Specifically, through the concatenation of soft weight maps and the output feature maps of adaptive merging, the method provides an adaptive map that preserves original feature information with soft weight map assistance and protects early features from corruption, thus preventing gradient erosion.
Result
2
The experiments were quantitatively and qualitatively evaluated on multi-view 3D video sequences provided by Microsoft Labs and four 3D high efficiency video coding (3D-HEVC) sequences. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics were used to measure the algorithm’s performance, and a set of hole masks suitable for virtual view rendering were collected for training. Our experimental results demonstrate that our model yields the most reasonable images in terms of subjective perceptual quality. Furthermore, compared with the model with the second-highest performance, our model outperforms in terms of PSNR and SSIM, improving 1.302 dB, 1.728 dB, 0.068 dB, and 0.766 dB, and 0.007, 0.002, 0.002, 0.002, and 0.033 on the Ballet, Breakdancers, Lovebird1, and Poznan_Street datasets, respectively. Meanwhile, compared with the deep learning model, the PSNR and SSIM increased by 0.418 dB and 0.793 dB, and 0.011 and 0.007, respectively, in the Newspaper and Kendo datasets. In addition, we conducted a series of ablation experiments to verify the effectiveness of each module in our model, including the knowledge consistent attention module, the contextual feature propagation loss module, the weighted merging module, and the number of iterations.
Conclusion
2
In this study, we apply deep learning to the field of hole filling in virtual view rendering. Our proposed progressive iterative network model was validated through experimental demonstration. We observed that our model performs exceptionally well in terms of avoiding tedious processes and minimizing foreground texture infiltration, ultimately leading to superior filling outcomes. However, our model exhibits some limitations. While it can focus on effective texture features, its overall efficiency still requires further improvement. Moreover, depth maps associated with 3D video sequences can be utilized as a guide, enabling the convolutional neural network to comprehend more intricate structural aspects and enhancing the model’s overall performance. In future research, we may consider merging frame interpolation and inpainting techniques to concentrate on the motion-related information of objects over time.
虚拟视点绘制空洞填充注意力特征提取多视点视频加深度
virtual view renderinghole-fillingattentionfeature extractionmulti-view video plus depth
Ahn L and Kim C. 2013. A novel depth-based virtual view synthesis method for free viewpoint video. IEEE Transactions on Broadcasting, 59(4): 614-626 [DOI: 10.1109/TBC.2013.2281658http://dx.doi.org/10.1109/TBC.2013.2281658]
Chen S Q, Liu Q and Yang Y. 2020. Adaptive multi-modality residual network for compression distorted multi-view depth video enhancement. IEEE Access, 8: 97072-97081 [DOI: 10.1109/ACCESS.2020.2996258http://dx.doi.org/10.1109/ACCESS.2020.2996258]
Criminisi A, Perez P and Toyama K. 2004. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing, 13(9): 1200-1212 [DOI: 10.1109/TIP.2004.833105http://dx.doi.org/10.1109/TIP.2004.833105]
Gatys L A, Ecker A S and Bethge M. 2016. Image style transfer using convolutional neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2414-2423 [DOI: 10.1109/CVPR.2016.265http://dx.doi.org/10.1109/CVPR.2016.265]
Han D X, Chen H, Tu C H and Xu Y Y. 2018. View synthesis using foreground object extraction for disparity control and image inpainting. Journal of Visual Communication and Image Representation, 56: 287-295 [DOI: 10.1016/j.jvcir.2018.10.004http://dx.doi.org/10.1016/j.jvcir.2018.10.004]
Johnson J, Alahi A and Li F F. 2016. Perceptual losses for real-time style transfer and super-resolution//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 694-711 [DOI: 10.1007/978-3-319-46475-6_43http://dx.doi.org/10.1007/978-3-319-46475-6_43]
Lee P J and Effendi. 2011. Nongeometric distortion smoothing approach for depth map preprocessing. IEEE Transactions on Multimedia, 13(2): 246-254 [DOI: 10.1109/TMM.2010.2100372http://dx.doi.org/10.1109/TMM.2010.2100372]
Li J Y, Wang N, Zhang L F, Du B and Tao D C. 2020. Recurrent feature reasoning for image inpainting//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7757-7765 [DOI: 10.1109/CVPR42600.2020.00778http://dx.doi.org/10.1109/CVPR42600.2020.00778]
Liang H T, Chen X D, Xu H Y, Ren S Y, Wang Y and Cai H Y. 2019. Virtual view rendering based on depth map preprocessing and image inpainting. Journal of Computer-Aided Design and Computer Graphics, 31(8): 1278-1285
梁海涛, 陈晓冬, 徐怀远, 任思宇, 汪毅, 蔡怀宇. 2019. 基于深度图预处理和图像修复的虚拟视点绘制. 计算机辅助设计与图形学学报, 31(8): 1278-1285 [DOI: 10.3724/SP.J.1089.2019.17541http://dx.doi.org/10.3724/SP.J.1089.2019.17541]
Lin C Y, Zhao Y, Xiao J M and Tillo T. 2018. Region-based multiple description coding for multiview video plus depth video. IEEE Transactions on Multimedia, 20(5): 1209-1223 [DOI: 10.1109/TMM.2017.2766043http://dx.doi.org/10.1109/TMM.2017.2766043]
Liu G L, Reda F A, Shih K J, Wang T C, Tao A and Catanzaro B. 2018. Image inpainting for irregular holes using partial convolutions//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 89-105 [DOI: 10.1007/978-3-030-01252-6_6http://dx.doi.org/10.1007/978-3-030-01252-6_6]
Liu Z, Liu Q, Yang Y, Liu Y C, Jiang G Y and Yu M. 2016. Cluster-based cross-view filtering for compressed multi-view depth maps//Proceedings of IEEE Visual Communications and Image Processing Conference. Chengdu, China: IEEE: 1-4 [DOI: 10.1109/VCIP.2016.7805550http://dx.doi.org/10.1109/VCIP.2016.7805550]
Luo G B and Zhu Y S. 2017. Foreground removal approach for hole filling in 3D video and FVV synthesis. IEEE Transactions on Circuits and Systems for Video Technology, 27(10): 2118-2131 [DOI: 10.1109/TCSVT.2016.2583978http://dx.doi.org/10.1109/TCSVT.2016.2583978]
Luo G B, Zhu Y S, Weng Z Y and Li Z T. 2020. A disocclusion inpainting framework for depth-based view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(6): 1289-1302 [DOI: 10.1109/TPAMI.2019.2899837http://dx.doi.org/10.1109/TPAMI.2019.2899837]
Quan W Z, Zhang R S, Zhang Y, Li Z F, Wang J and Yan D M. 2022. Image inpainting with local and global refinement. IEEE Transactions on Image Processing, 31: 2405-2420 [DOI: 10.1109/TIP.2022.3152624http://dx.doi.org/10.1109/TIP.2022.3152624]
Smirnov S, Battisti F and Gotchev A P. 2019. Layered approach for improving the quality of free-viewpoint depth-image-based rendering images. Journal of Electronic Imaging, 28(1): #013049 [DOI: 10.1117/1.JEI.28.1.013049http://dx.doi.org/10.1117/1.JEI.28.1.013049]
Sun W X, Au O C, Xu L F, Li Y J and Hu W. 2012. Novel temporal domain hole filling based on background modeling for view synthesis//Proceedings of the 19th IEEE International Conference on Image Processing. Orlando, USA: IEEE: 2721-2724 [DOI: 10.1109/ICIP.2012.6467461http://dx.doi.org/10.1109/ICIP.2012.6467461]
Tanimoto M, Tehrani M P, Fujii T and Yendo T. 2011. Free-viewpoint TV. IEEE Signal Processing Magazine, 28(1): 67-76 [DOI: 10.1109/MSP.2010.939077http://dx.doi.org/10.1109/MSP.2010.939077]
Wang S M. 2016. An unidirectional criminisi algorithm for DIBR-synthesized images//Proceedings of the 2nd IEEE International Conference on Computer and Communications. Chengdu, China: IEEE: 574-578 [DOI: 10.1109/CompComm.2016.7924766http://dx.doi.org/10.1109/CompComm.2016.7924766]
Wang X, Liu Q, Peng Z J, Hou J H, Yuan H, Zhao T S, Qin Y, Wu K J, Liu W Y and Yang Y. 2023. Research progress of six degree of freedom (6DoF) video technology. Journal of Image and Graphics, 28(6): 1863-1890
王旭, 刘琼, 彭宗举, 侯军辉, 元辉, 赵铁松, 秦熠, 吴科君, 刘文予, 杨铀. 2023. 6DoF视频技术研究进展. 中国图象图形学报, 28(6): 1863-1890 [DOI: 10.11834/jig.230025http://dx.doi.org/10.11834/jig.230025]
Wegner K, Stankiewicz O, Tanimoto M and Domański M. 2013. Enhanced view synthesis reference software (VSRS) for free-viewpoint television. ISO/IEC JTC1/SC29/WG11 MPEG2013/M31520
Yao C, Tillo T, Zhao Y, Xiao J M, Bai H H and Lin C Y. 2014. Depth map driven hole filling algorithm exploiting temporal correlation information. IEEE Transactions on Broadcasting, 60(2): 394-404 [DOI: 10.1109/TBC.2014.2321671http://dx.doi.org/10.1109/TBC.2014.2321671]
Ye G Z, Liu Y B, Deng Y, Hasler N, Ji X Y, Dai Q H and Theobalt C. 2013. Free-viewpoint video of human actors using multiple handheld kinects. IEEE Transactions on Cybernetics, 43(5): 1370-1382 [DOI: 10.1109/TCYB.2013.2272321http://dx.doi.org/10.1109/TCYB.2013.2272321]
Yu J H, Lin Z, Yang J M, Shen X H, Lu X and Huang T. 2019. Free-Form image inpainting with gated convolution//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 4470-4479 [DOI: 10.1109/ICCV.2019.00457http://dx.doi.org/10.1109/ICCV.2019.00457]
Yu J H, Lin Z, Yang J M, Shen X H, Lu X and Huang T S. 2018. Generative image inpainting with contextual attention//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5505-5514 [DOI: 10.1109/CVPR.2018.00577http://dx.doi.org/10.1109/CVPR.2018.00577]
Zeng Y, Lin Z, Lu H C and Patel V M. 2021. CR-fill: generative image inpainting with auxiliary contextual reconstruction//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 14144-14153 [DOI: 10.1109/ICCV48922.2021.01390http://dx.doi.org/10.1109/ICCV48922.2021.01390]
Zhu C and Li S. 2016. Depth image based view synthesis: new insights and perspectives on hole generation and filling. IEEE Transactions on Broadcasting, 62(1): 82-93 [DOI: 10.1109/TBC.2015.2475697http://dx.doi.org/10.1109/TBC.2015.2475697]
Zhu S P, Xu H and Yan L N. 2019. An improved depth image based virtual view synthesis method for interactive 3D video. IEEE Access, 7: 115171-115180 [DOI: 10.1109/ACCESS.2019.2935021http://dx.doi.org/10.1109/ACCESS.2019.2935021]
相关作者
相关机构