融入变分自编码网络的文本生成三维运动人体

李健; 杨钧; 王丽燕; 王永归

doi:10.11834/jig.230291

图像理解和计算机视觉 | 浏览量 : 0 下载量: 271 CSCD: 1

PDF
导出
分享
收藏
专辑

融入变分自编码网络的文本生成三维运动人体
Incorporating variational auto-encoder networks for text-driven generation of 3D motion human body
2024年29卷第5期页码：1434-1446
收稿日期：2023-06-01，

修回日期：2023-09-25，

纸质出版日期：2024-05-16
DOI： 10.11834/jig.230291
稿件说明：

移动端阅览

李健，杨钧，王丽燕，王永归. 2024. 融入变分自编码网络的文本生成三维运动人体. 中国图象图形学报， 29(05):1434-1446 DOI： 10.11834/jig.230291.

Li Jian， Yang Jun， Wang Liyan， Wang Yonggui. 2024. Incorporating variational auto-encoder networks for text-driven generation of 3D motion human body. Journal of Image and Graphics， 29(05):1434-1446 DOI： 10.11834/jig.230291.

摘要

目的

针对现有动态三维数字人体模型生成时不能改变体型、运动固定单一等问题，提出一种融合变分自编码器（variational auto-encoder， VAE）网络、对比语言—图像预训练（contrastive language-image pretraining，CLIP）网络与门控循环单元（gate recurrent unit，GRU）网络生成运动三维人体模型的方法。该方法可根据文本描述生成相应体型和动作的三维人体模型。

方法

首先，使用VAE编码网络生成潜在编码，结合CLIP网络零样本生成体型与文本表述相符的人体模型，以解决蒙皮多人线性（skinned multi-person linear， SMPL）模型参数不合理而生成不符合正常体型特征的人体模型问题；其次，采用VAE网络与GRU网络生成与文本表述相符的变长时间三维人体姿势序列，以解决现有运动生成方法仅生成事先指定的姿势序列、无法生成运动时间不同的姿势序列问题；最后，将体型特征与运动特征结合，得到三维运动人体模型。

结果

在HumanML3D数据集上进行人体生成实验，并与其他3种方法进行比较，相比于现有最好方法，R精度的Top1、Top2和Top3分别提高了0.031、0.034和0.028，弗雷歇初始距离（Fréchet inception distance，FID）提高了0.094，多样性提高了0.065。消融实验验证了模型的有效性，结果表明本文方法对人体模型生成效果有提升。

结论

本文方法可通过文本描述生成运动三维人体模型，模型的体型和动作更符合输入文本的描述。

Abstract

Objective

Artificial intelligence generated content （AIGC） technology can reduce the workload of three-dimensional （3D） modeling when applied to generate virtual 3D scene models using natural language. For static 3D objects， methods have arisen in generating high-precision 3D models that match a given textual description. By contrast， for dynamic digital human body models， which is also highly popular in numerous circumstances， only two-dimensional （2D） human images or sequences of human poses can be generated corresponding to a given textual description. Dynamic 3D human models cannot be generated with the same way above using natural language. Moreover， current existing methods can lead to problems such as immutable shape and motion when generating dynamic digital human models. A method fusing variational auto-encoder （VAE）， contrastive language-image pretraining （CLIP）， and gate recurrent unit （GRU）， which can be used to generate satisfactory dynamic 3D human models corresponding to the shapes and motions described by the text， is proposed to address the above problems.

Method

A method based on the VAE network is proposed in this paper to generate dynamic 3D human models， which correspond to the body shape and action information described in the text. Notably， a variety of pose sequences with variable time duration can be generated with the proposed method. First， the shape information of the body is obtained through the body shape generation module based on the VAE network and CLIP model， and zero-shot samples are used to generate the skinned multi-person linear （SMPL） parametric human model that matches the textual description. Specifically， the VAE network encodes the body shape of the SMPL model， the CLIP model matches the textual descriptions and body shapes， and the 3D human model with the highest matching score is thus filtered. Second， variable-length 3D human pose sequences are generated through the body action generation module based on the VAE and GRU networks that match the textual description. Particularly， the VAE self-encoder encodes the dynamic human poses. The action length sampling network then obtains the length of time that matches the textual description of the action. The GRU and VAE networks encode the input text and generate the diverse dynamic 3D human pose sequences through the decoder. Finally， a dynamic 3D human model corresponding to the body shape and action description can be generated by fusing the body shape and action information generated above. The performance of the method is evaluated in this paper using the HumanML3D dataset， which comprises 14 616 motions and 44 970 linguistic annotations. Some of the motions in the dataset are mirrored before training， and some words are replaced in the motion descriptions （e.g.， “left” is changed to “right”） to expand the dataset. In the experiments in this paper， the HumanML3D dataset is divided into training， testing， and validation sets in the ratios of 80%， 15%， and 5%， respectively. The experiments in this paper are conducted in an Ubuntu 18.04 environment with a Tesla V100 GPU and 16GB of video memory. The adaptive moment estimation （Adam） optimizer is trained in 300 training rounds with a learning rate of 0.000 1 and a batch size of 128 to train the motion self-encoder. The Adam optimizer performs 320 training rounds with a learning rate of 0.000 2 and a batch size of 32 to train the motion generator. This optimizer also performs 200 training rounds with a learning rate of 0.000 1 and a batch size of 64 for training the motion length network.

Result

Dynamic 3D human model generation experiments were conducted on the HumanML3D dataset. Compared with three other state-of-the-art methods， the proposed method shows an improvement of 0.031， 0.034， and 0.028 in the Top1， Top2， and Top3 dimensions of R-precision， 0.094 in Fréchet inception distance（FID）， and 0.065 in diversity， respectively， considering the best available results. The experimental analysis for qualitative evaluation was divided into three parts： body shape feature generation， action feature generation， and dynamic 3D human model generation including body features. The body feature generation part was tested using different text descriptions （e.g.， tall， short， fat， thin）. For the action feature generation part， the same text descriptions are tested using this paper and other methods for generation comparison. Combining the body shape features and the action feature of the human body， the generation of dynamic 3D human models with body shape features is demonstrated. In addition， ablation experiments， including ablation comparison with different methods using different loss functions， are performed to further demonstrate the effectiveness of the method. The final experimental results show that the proposed method in this paper improves the effectiveness of the model.

Conclusion

This paper presents methods for generating dynamic 3D human models that conform to textual descriptions， fusing body shape and action information. The body shape generation module can generate SMPL parameterized human models whose body shape conforms to the textual description， while the action generation module can generate variable-length 3D human pose sequences that match the textual description. Experimental results show that the proposed method can effectively generate motion dynamic 3D human models that conform to textual descriptions， and the generated human models have diverse body shape and motions. On the HumanML3D dataset， the performance of the method outperforms other similar state-of-the-art algorithms.

关键词

Keywords

references

Ahuja C and Morency L P . 2019 . Language2Pose： natural language grounded pose forecasting // Proceedings of 2019 International Conference on 3D Vision . Québec City， Canada ： IEEE： 719 - 728 ［ DOI： 10.1109/3DV.2019.00084 http://dx.doi.org/10.1109/3DV.2019.00084 ］

Bhattacharya U ， Rewkowski N ， Banerjee A ， Guhan P ， Bera A and Manocha D . 2021 . Text2Gestures： a Transformer-based network for generating emotive body gestures for virtual agents // Proceedings of 2021 IEEE Virtual Reality and 3D User Interfaces . Lisboa， Portugal ： IEEE： 1 - 10 ［ DOI： 10.1109/VR50410.2021.00037 http://dx.doi.org/10.1109/VR50410.2021.00037 ］

Bogo F ， Romero J ， Pons-Moll G and Black M J . 2017 . Dynamic FAUST： registering human bodies in motion // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， USA ： IEEE： 5573 - 5582 ［ DOI： 10.1109/CVPR.2017.591 http://dx.doi.org/10.1109/CVPR.2017.591 ］

Ghosh A ， Cheema N ， Oguz C ， Theobalt C and Slusallek P . 2021 . Synthesis of compositional animations from textual descriptions // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 1376 - 1386 ［ DOI： 10.1109/ICCV48922.2021.00143 http://dx.doi.org/10.1109/ICCV48922.2021.00143 ］

Guo C ， Zou S H ， Zuo X X ， Wang S ， Ji W ， Li X Y and Cheng L . 2022 . Generating diverse and natural 3D human motions from text // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 5142 - 5151 ［ DOI： 10.1109/CVPR52688.2022.00509 http://dx.doi.org/10.1109/CVPR52688.2022.00509 ］

Guo C ， Zuo X X ， Wang S ， Zou S H ， Sun Q Y ， Deng A N ， Gong M L and Cheng L . 2020 . Action2Motion： conditioned generation of 3D human motions // Proceedings of the 28th ACM International Conference on Multimedia . New York， USA ： ACM： 2021 - 2029 ［ DOI： 10.1145/3394171.3413635 http://dx.doi.org/10.1145/3394171.3413635 ］

Hong F Z ， Zhang M Y ， Pan L ， Cai Z ， Yang L and Liu Z W . 2022 . AvatarCLIP： zero-shot text-driven generation and animation of 3D avatars . ACM Transactions on Graphics ， 41 （ 4 ）： # 161 ［ DOI： 10.1145/3528223.3530094 http://dx.doi.org/10.1145/3528223.3530094 ］

Jiang Y M ， Yang S ， Qiu H N ， Wu W ， Loy C C and Liu Z W . 2022 . Text2Human： text-driven controllable human image generation . ACM Transactions on Graphics ， 41 （ 4 ）： # 162 ［ DOI： 10.1145/3528223.3530104 http://dx.doi.org/10.1145/3528223.3530104 ］

Kocabas M ， Huang C H P ， Hilliges O and Black M J . 2021 . PARE： part attention regressor for 3D human body estimation // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 11107 - 11117 ［ DOI： 10.1109/ICCV48922.2021.01094 http://dx.doi.org/10.1109/ICCV48922.2021.01094 ］

Lin C H ， Gao J ， Tang L M ， Takikawa T ， Zeng X H ， Huang X ， Kreis K ， Fidler S ， Liu M Y and Lin T Y . 2023 . Magic3D： high-resolution text-to-3D content creation // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 300 - 309 ［ DOI： 10.1109/CVPR52729.2023.00037 http://dx.doi.org/10.1109/CVPR52729.2023.00037 ］

Liu Z Z ， Wang Y ， Qi X J and Fu C W . 2022 . Towards implicit text-guided 3D shape generation // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 17875 - 17885 ［ DOI： 10.1109/CVPR52688.2022.01737 http://dx.doi.org/10.1109/CVPR52688.2022.01737 ］

Loper M ， Mahmood N ， Romero J ， Pons-Moll G and Black M J . 2015 . SMPL： a skinned multi-person linear model . ACM Transactions on Graphics ， 34 （ 6 ）： # 248 ［ DOI： 10.1145/2816795.2818013 http://dx.doi.org/10.1145/2816795.2818013 ］

Mahmood N ， Ghorbani N ， Troje N F ， Pons-Moll G and Black M . 2019 . AMASS： archive of motion capture as surface shapes // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 5441 - 5450 ［ DOI： 10.1109/ICCV.2019.00554 http://dx.doi.org/10.1109/ICCV.2019.00554 ］

Mansimov E ， Parisotto E ， Ba J L and Salakhutdinov R . 2015 . Generating images from captions with attention ［EB/OL］. ［ 2023-05-16 ］. https://arxiv.org/pdf/1511.02793.pdf https://arxiv.org/pdf/1511.02793.pdf

Pavlakos G ， Choutas V ， Ghorbani N ， Bolkart T ， Osman A A ， Tzionas D and Black M J . 2019 . Expressive body capture： 3D hands， face， and body from a single image // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 10967 - 10977 ［ DOI： 10.1109/CVPR.2019.01123 http://dx.doi.org/10.1109/CVPR.2019.01123 ］

Petrovich M ， Black M J and Varol G . 2021 . Action-conditioned 3D human motion synthesis with Transformer VAE // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 10965 - 10975 ［ DOI： 10.1109/ICCV48922.2021.01080 http://dx.doi.org/10.1109/ICCV48922.2021.01080 ］

Ramesh A ， Pavlov M ， Goh G ， Gray S ， Voss C ， Radford A ， Chen M and Sutskever I . 2021 . Zero-shot text-to-image generation// Proceedings of the 38th International Conference on Machine Learning . 8821 - 8831

Saito S ， Yang J L ， Ma Q L and Black M J . 2021 . SCANimate： weakly supervised learning of skinned clothed avatar networks // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 2885 - 2896 ［ DOI： 10.1109/CVPR46437.2021.00291 http://dx.doi.org/10.1109/CVPR46437.2021.00291 ］

Sanghi A ， Chu H ， Lambourne J G ， Wang Y ， Cheng C Y ， Fumero M and Malekshan K R . 2022 . CLIP-forge： towards zero-shot text-to-shape generation // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 18582 - 18592 ［ DOI： 10.1109/CVPR52688.2022.01805 http://dx.doi.org/10.1109/CVPR52688.2022.01805 ］

Wang Z Y ， Yu P ， Zhao Y ， Zhang R Y ， Zhou Y F ， Yuan J S and Chen C Y . 2020 . Learning diverse stochastic human-action generators by learning smooth latent transitions // Proceedings of the 34th AAAI Conference on Artificial Intelligence ， the 32nd Innovative Applications of Artificial Intelligence Conference， IAAI 2020， the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. New York， USA ： AAAI： 12281 - 12288 ［ DOI： 10.1609/aaai.v34i07.6911 http://dx.doi.org/10.1609/aaai.v34i07.6911 ］

Xu H Y ， Alldieck T and Sminchisescu C . 2021 . H-NeRF ： neural radiance fields for rendering and temporal reconstruction of humans in motion //Proceedings of the 35th International Conference on Neural Information Processing Systems. 14955 - 14966

Yang H ， Chen R ， An S P ， Wei H and Zhang H . 2023 . The growth of image-related three dimensional reconstruction techniques in deep learning-driven era： a critical summary . Journal of Image and Graphics ， 28 （ 8 ）： 2396 - 2409

杨航，陈瑞，安仕鹏，魏豪，张衡 . 2023 . 深度学习背景下的图像三维重建技术进展综述 . 中国图象图形学报， 28 （ 8 ）： 2396 - 2409 ［ DOI： 10.11834/jig.220376 http://dx.doi.org/10.11834/jig.220376 ］

Yao L ， Zhang Y A ， Zhang M X and Wan Y . 2021 . The impact of joint axis angle prior on the results of 3D human body reconstruction . Journal of Image and Graphics ， 26 （ 12 ）： 2918 - 2930

姚砺，张幼安，张梦雪，万燕 . 2021 . 关节轴角先验对3维人体重建结果的影响 . 中国图象图形学报， 26 （ 12 ）： 2918 - 2930 ［ DOI： 10.11834/jig.200348 http://dx.doi.org/10.11834/jig.200348 ］

Yu P ， Zhao Y ， Li C Y ， Yuan J S and Chen C Y . 2020 . Structure-aware human-action generation // Proceedings of the 16th European Conference on Computer Vision-ECCV 2020 . Glasgow， UK ： Springer： 18 - 34 ［ DOI： 10.1007/978-3-030-58577-8_2 http://dx.doi.org/10.1007/978-3-030-58577-8_2 ］

Zhang Q ， Fu B ， Ye M and Yang R G . 2014 . Quality dynamic human body modeling using a single low-cost depth camera // Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition . Columbus， USA ： IEEE： 676 - 683 ［ DOI： 10.1109/CVPR.2014.92 http://dx.doi.org/10.1109/CVPR.2014.92 ］

Zuo X X ， Wang S ， Zheng J B ， Yu W W ， Gong M L ， Yang R G and Cheng L . 2021 . SparseFusion： dynamic human avatar modeling from sparse RGBD images . IEEE Transactions on Multimedia ， 23 ： 1617 - 1629 ［ DOI： 10.1109/TMM.2020.3001506 http://dx.doi.org/10.1109/TMM.2020.3001506 ］

文章被引用时，请邮件提醒。

提交

高光谱图像智能分类研究综述与展望

走向通用行人重识别：预训练大模型技术在行人重识别的应用综述

针对视觉深度学习模型的物理对抗攻击研究综述

全栈全谱：医疗影像人工智能的探索与应用

医学影像中的生成技术