带深度信息监督的神经辐射场虚拟视点画面合成

刘晓楠; 陈纯毅; 胡小娟; 于海洋

doi:10.11834/jig.221188

图像理解和计算机视觉 | 浏览量 : 0 下载量: 10 CSCD: 0

PDF
导出
分享
收藏
专辑

带深度信息监督的神经辐射场虚拟视点画面合成
Virtual viewpoint image synthesis using neural radiance fields with depth information supervision
2024年29卷第7期页码：2035-2045
纸质出版日期： 2024-07-16 ，
DOI： 10.11834/jig.221188
稿件说明：

移动端阅览

刘晓楠，陈纯毅，胡小娟，于海洋. 2024. 带深度信息监督的神经辐射场虚拟视点画面合成. 中国图象图形学报， 29(07):2035-2045

Liu Xiaonan， Chen Chunyi， Hu Xiaojuan， Yu Haiyang. 2024. Virtual viewpoint image synthesis using neural radiance fields with depth information supervision. Journal of Image and Graphics， 29(07):2035-2045
刘晓楠，陈纯毅，胡小娟，于海洋. 2024. 带深度信息监督的神经辐射场虚拟视点画面合成. 中国图象图形学报， 29(07):2035-2045 DOI： 10.11834/jig.221188.

Liu Xiaonan， Chen Chunyi， Hu Xiaojuan， Yu Haiyang. 2024. Virtual viewpoint image synthesis using neural radiance fields with depth information supervision. Journal of Image and Graphics， 29(07):2035-2045 DOI： 10.11834/jig.221188.

摘要

目的

在神经辐射场虚拟视点画面合成过程中，因视图数量过少或视图颜色不一致产生离群稀疏深度值问题，提出利用深度估计网络的密集深度值监督神经辐射场虚拟视点画面合成的方法来解决此问题。

方法

首先输入视图进行运动恢复结构获取稀疏深度值，其次将RGB视图输入New CRFs（neural window fully-connected CRFs for monocular depth estimation）深度估计网络得到预估深度值，计算预估深度值和稀疏深度值之间的标准差。最后，利用预估深度值和计算得到的标准差，对神经辐射场的训练进行监督。

结果

实验在NeRF Real数据集上与其他算法进行了实验对比。在少量视图合成实验中，本文方法在图像质量和效果优于仅使用RGB监督的NeRF（neural radiance fields）方法和使用稀疏深度信息监督的方法，峰值信噪比较NeRF方法提高24%，较使用稀疏深度信息监督的方法提高19.8%；结构相似度比NeRF方法提高36%，比使用稀疏深度信息监督的方法提高16.6%。同时为了验证算法的数据效率，进行了相同的迭代次数达到的峰值信噪比的比较，相较于NeRF方法，数据效率也有明显提高。

结论

实验结果表明，本文所提出的利用深度估计网络密集深度值监督神经辐射场虚拟视点画面合成的方法，解决了视图数量过少或者视图颜色不一致产生离群稀疏深度值问题。

Abstract

Objective

Viewpoint synthesis techniques are widely applied to computer graphics and computer vision. In accordance with whether they depend on geometric information or not， virtual viewpoint synthesis methods can be classified into two distinct categories： image-based rendering and model-based rendering. 1） Image-based rendering typically utilizes input data from camera arrays or light field cameras to achieve higher-quality rendering outcomes without the need to reconstruct the geometric information of the scene. Among the image-based rendering methods， depth map-based rendering technology is currently a popular research topic for virtual viewpoint rendering. However， this technology is prone to be affected by depth errors， leading to challenges such as holes and artifacts in the generated virtual viewport image. In addition， obtaining precise depth information for real-world scenes poses difficulties in practical applications. 2） Model-based rendering involves 3D geometric modeling of real-world scenes. This method utilizes techniques such as projection transformation， cropping， fading， and texture mapping to synthesize virtual viewpoint images. However， quickly modeling real-world scenes is a significant disadvantage of this approach. With the emergence of neural rendering technology， the neural radiance fields technique employs a neural network to represent the 3D scene and combines it with volume rendering technology for viewpoint synthesis， thus producing photo-realistic viewpoint synthesis results. However， this approach is heavily reliant on the appearance of the view and requires a substantial number of views to be input for modeling. As a result， this method may be capable of perfectly explaining the training images but generalizes poorly to novel test views. Depth information is introduced for supervision to reduce the dependence of the neural radiance fields on the view appearance. However， structure from motion produces sparse depth values with inaccuracy and outliers due to the limited number of view inputs. Therefore， this study proposes a virtual viewpoint synthesis algorithm for supervising the neural radiance fields by using dense depth values obtained from a depth estimation network and introduces an embedding vector in the fitting function of the neural radiance fields to improve the virtual viewport image quality.

Method

First， the camera’s internal and external reference matrices were calibrated for the input view. The 3D point cloud data in the world coordinate system were then converted to 3D point cloud data in the camera coordinate system by using the camera’s external reference matrix. After that， the 3D point cloud data in the camera coordinate system were projected onto the image plane by using the camera’s internal reference matrix to obtain the sparse depth value. Next， the RGB view was input into the new conditional random fields （CRFs） network to obtain an estimated depth value， and the standard deviation between the estimated depth value and the sparse depth value was calculated. The new CRFs network used the FC-CRFs module， which was constructed using a multi-headed attention mechanism， as the decoder and used the visual converter as the encoder to construct a U-shaped codec structure to estimate the depth value. Finally， the training of the neural radiance fields was supervised using the estimated depth values and the computed standard deviations. The training process of the neural radiance fields began by emitting camera rays on the input view to determine the sampling locations and the sampling point parameterization scheme. The re-parameterized sample point locations were then fed into the network for fitting， and the network outputted the volume density and color values to calculate the rendered color values and rendered depth values by using the volume rendering technique. The training process was supervised using the color loss between the rendered color value and the true color value and the depth loss between the predicted depth value and the rendered depth value.

Result

Experiments were conducted on the NeRF Real dataset， which comprises eight real-world scenes captured by forward-facing cameras. The evaluation involved the comparison of the proposed method with other algorithms， including the neural radiance field （NeRF） method that only uses RGB supervision and the method that employs sparse depth information supervision. The assessment criteria included peak signal-to-noise ratio， structural similarity index， and learned perceptual image patch similarity. Results indicate that the performance of proposed method surpassed that of the NeRF method that relied solely on RGB supervision and the method that employed sparse depth information supervision in a limited number of view synthesis experiments in terms of graphical quality and effectiveness. Specifically， the proposed method achieved a 24% improvement in peak signal-to-noise ratio over the NeRF method and a 19.8% improvement over the sparse depth information supervision method. In addition， the proposed method exhibited a 36% improvement in structural similarity index over the NeRF method and a 16.6% improvement over the sparse depth information supervision method. The data efficiency of the algorithm was evaluated by comparing the peak signal-to-noise ratio achieved by the same number of iterations. The proposed method demonstrated a significant improvement compared with the NeRF method.

Conclusions

In this study， we proposed a method for synthesizing virtual viewport images by using neural radiance fields supervised by dense depth. The method uses the dense depth values outputted by the depth estimation network to supervise the training of the neural radiance fields and introduced embedding vector during training fitting function. The experiments demonstrated that our approach effectively addresses the issue of sparse depth values resulting from insufficient views or inconsistent view colors and can achieve high-quality synthesized images， particularly when the number of input views is limited.

关键词

视点合成神经辐射场（NeRF）深度监督深度估计体渲染

Keywords

viewpoint synthesisneural radiance field （NeRF）depth supervisiondepth estimationvolume rendering

references

Barron J T， Mildenhall B， Tancik M， Hedman P， Martin-Brualla R and Srinivasan P P. 2021. Mip-NeRF： a multiscale representation for anti-aliasing neural radiance fields//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 5835-5844 ［DOI： 10.1109/ICCV48922.2021.00580http://dx.doi.org/10.1109/ICCV48922.2021.00580］

Chang Y and Gai M. 2021. A review on neural radiance fields based view synthesis. Journal of Graphics， 42（3）： 376-384

常远，盖孟. 2021. 基于神经辐射场的视点合成算法综述. 图学学报， 42（3）： 376-384 ［DOI： 10.11996/JG.j.2095-302X.2021030376http://dx.doi.org/10.11996/JG.j.2095-302X.2021030376］

Chen L Y， Chen S J， Cen K and Zhu W. 2020. High image quality virtual viewpoint rendering method and its GPU acceleration. Journal of Chinese Computer Systems， 41（10）： 2212-2218

陈璐瑶，陈思洁，岑宽，朱威. 2020. 一种高图像质量的虚拟视点绘制方法及GPU加速. 小型微型计算机系统， 41（10）： 2212-2218 ［DOI： 10.3969/j.issn.1000-1220.2020.10.032http://dx.doi.org/10.3969/j.issn.1000-1220.2020.10.032］

Deng K L， Liu A， Zhu J Y and Ramanan D. 2022. Depth-supervised neRF： fewer views and faster training for free//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE： 12872-12881 ［DOI： 10.1109/CVPR52688.2022.01254http://dx.doi.org/10.1109/CVPR52688.2022.01254］

Geiger A， Lenz P and Urtasun R. 2012. Are we ready for autonomous driving？ The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence， USA： IEEE： 3354-3361 ［DOI： 10.1109/CVPR.2012.6248074http://dx.doi.org/10.1109/CVPR.2012.6248074］

Hedman P， Philip J， Price T， Frahm J M， Drettakis G and Brostow G. 2018. Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics， 37（6）： #257 ［DOI： 10.1145/3272127.3275084http://dx.doi.org/10.1145/3272127.3275084］

Jensen R， Dahl A， Vogiatzis G， Tola E and Aanaes H. 2014. Large scale multi-view stereopsis evaluation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus， USA： IEEE： 406-413 ［DOI： 10.1109/CVPR.2014.59http://dx.doi.org/10.1109/CVPR.2014.59］

Ji X P， Ren X F， Zhou Z Z， Huang Y Y， Zhen J N， Zhou L， Zhou X W and Zhang G F. 2022. Dynamic object removal and image inpainting for 3D panoramic tour. Journal of Computer-Aided Design Computer Graphics， 34（8）： 1147-1159

姬晓鹏，任鑫芳，周震震，黄贻煜，甄佳楠，周黎，周晓巍，章国锋. 2022. 三维全景漫游中的动态物体消除与视图修补方法. 计算机辅助设计与图形学学报， 34（8）： 1147-1159 ［DOI： 10.3724/SP.J.1089.2022.19142http://dx.doi.org/10.3724/SP.J.1089.2022.19142］

Liang H T， Chen X D， Xu H Y， Ren S Y， Wang Y and Cai H Y. 2019. Virtual view rendering based on depth map preprocessing and image inpainting. Journal of Computer-Aided Design and Computer Graphics， 31（8）： 1278-1285

梁海涛，陈晓东，徐怀远，任思宇，汪毅，蔡怀宇. 2019. 基于深度图预处理和图像修复的虚拟视点绘制. 计算机辅助设计与图形学学报， 31（8）： 1278-1285 ［DOI： 10.3724/SP.J.1089.2019.17541http://dx.doi.org/10.3724/SP.J.1089.2019.17541］

Mildenhall B， Srinivasan P P， Ortiz-Cayon R， Kalantari N K， Ramamoorthi R， Ng R and Kar A. 2019. Local light field fusion： practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics， 38（4）： #29 ［DOI： 10.1145/3306346.3322980http://dx.doi.org/10.1145/3306346.3322980］

Mildenhall B， Srinivasan P P， Tancik M， Barron J T， Ramamoorthi R and Ng R. 2020. NeRF： representing scenes as neural radiance fields for view synthesis. Communications of the ACM， 65（1）： 99-106 ［DOI： 10.1145/3503250http://dx.doi.org/10.1145/3503250］

Penner E and Zhang L. 2017. Soft 3D reconstruction for view synthesis. ACM Transactions on Graphics， 36（6）： #235 ［DOI： 10.1145/3130800.3130855http://dx.doi.org/10.1145/3130800.3130855］

Roessle B， Barron J T， Mildenhall B， Srinivasan P P and Niener M. 2022 Dense depth priors for neural radiance fields from sparse input views//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA. IEEE： 12892-12901. ［DOI： 10.1109/CVPR52688.2022.01255http://dx.doi.org/10.1109/CVPR52688.2022.01255］

Schönberger J L and Frahm J M. 2016. Structure-from-motion revisited//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 4104-4113 ［DOI： 10.1109/CVPR.2016.445http://dx.doi.org/10.1109/CVPR.2016.445］

Silberman N， Hoiem D， Kohli P and Fergus R. 2012. Indoor segmentation and support inference from RGBD images//Proceedings of the 12th European Conference on Computer Vision. Florence， Italy： Springer： 746-760 ［DOI： 10.1007/978-3-642-33715-4_54http://dx.doi.org/10.1007/978-3-642-33715-4_54］

Wang Z， Bovik A C， Sheikh H R and Simoncelli E P. 2004. Image quality assessment： from error visibility to structural similarity. IEEE Transactions on Image Processing， 13（4）： 600-612 ［DOI： 10.1109/TIP.2003.819861http://dx.doi.org/10.1109/TIP.2003.819861］

Wei Y， Liu S H， Rao Y M， Zhao W， Lu J W and Zhou J. 2021. NerfingMVS： guided optimization of neural radiance fields for indoor multi-view stereo//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 5590-5599 ［DOI： 10.1109/ICCV48922.2021.00556http://dx.doi.org/10.1109/ICCV48922.2021.00556］

Yao L， Li X M and Han Y D. 2017. Virtual viewpoint synthesis of quick remove the distortion. Journal of Graphics， 38（4）： 566-576

姚莉，李小敏，韩应栋. 2017. 一种快速消除失真的虚拟视点合成方法. 图学学报， 38（4）： 566-576 ［DOI： 10.11996/JG.j.2095-302X.2017040566http://dx.doi.org/10.11996/JG.j.2095-302X.2017040566］

Yuan W H， Gu X D， Dai Z Z， Zhu S Y， and Tan P. 2022. Neural window fully-connected CRFs for monocular depth estimation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans， USA： IEEE 3906-3915 ［DOI： 10.1109/CVPR52688.2022.00389http://dx.doi.org/10.1109/CVPR52688.2022.00389］

Zhang K， Riegler G， Snavely N and Koltun V. 2020. NeRF++： analyzing and improving neural radiance fields ［EB/OL］. ［2023-01-09］. https://arxiv.org/pdf/2010.07492.pdfhttps://arxiv.org/pdf/2010.07492.pdf

Zhang R， Isola P， Efros A A， Shechtman E and Wang O. 2018. The unreasonable effectiveness of deep features as a perceptual metric//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 586-595 ［DOI： 10.1109/CVPR.2018.00068http://dx.doi.org/10.1109/CVPR.2018.00068］

文章被引用时，请邮件提醒。

提交