基于多视图自适应3D骨架网络的工业装箱动作识别

张学琪; 胡海洋; 潘开来; 李忠金

doi:10.11834/jig.230084

图像分析和识别 | 浏览量 : 0 下载量: 6 CSCD: 0

PDF
导出
分享
收藏
专辑

基于多视图自适应3D骨架网络的工业装箱动作识别
Industrial box-packing action recognition based on multi-view adaptive 3D skeleton network
2024年29卷第5期页码：1392-1407
纸质出版日期： 2024-05-16 ，
DOI： 10.11834/jig.230084
稿件说明：

移动端阅览

张学琪，胡海洋，潘开来，李忠金. 2024. 基于多视图自适应3D骨架网络的工业装箱动作识别. 中国图象图形学报， 29(05):1392-1407

Zhang Xueqi， Hu Haiyang， Pan Kailai， Li Zhongjin. 2024. Industrial box-packing action recognition based on multi-view adaptive 3D skeleton network. Journal of Image and Graphics， 29(05):1392-1407
张学琪，胡海洋，潘开来，李忠金. 2024. 基于多视图自适应3D骨架网络的工业装箱动作识别. 中国图象图形学报， 29(05):1392-1407 DOI： 10.11834/jig.230084.

Zhang Xueqi， Hu Haiyang， Pan Kailai， Li Zhongjin. 2024. Industrial box-packing action recognition based on multi-view adaptive 3D skeleton network. Journal of Image and Graphics， 29(05):1392-1407 DOI： 10.11834/jig.230084.

摘要

目的

动作识别在工业生产制造中变得越来越重要。但在复杂的生产车间内，动作识别技术会受到环境遮挡、视角变化以及相似动作识别等干扰。基于此，提出一种结合双视图骨架多流网络的装箱行为识别方法。

方法

将堆叠的差分图像（residual frames， RF）作为模型的输入，结合多视图模块解决人体被遮挡的问题。在视角转换模块中，将差分人体骨架旋转到最佳的虚拟观察角度，并将转换后的骨架数据传入3层堆叠的长短时记忆网络（long short-term memory， LSTM）中，将不同视角下的分类分数进行融合，得到识别结果。为了解决细微动作的识别问题，采用结合注意力机制的局部定位图像卷积网络，传入到卷积神经网络中进行识别。融合骨架和局部图像识别的结果，预测工人的行为动作。

结果

在实际生产环境下的装箱场景中进行了实验，得到装箱行为识别准确率为92.31%，较大幅度领先于现有的主流行为识别方式。此外，该方法在公共数据集NTU（Nanyang Technological University） RGB+D上进行了评估，结果显示在CS（cross-subject）协议和CV（cross-view）协议中的性能分别达到了85.52%和93.64%，优于其他网络，进一步验证了本文方法的有效性和准确性。

结论

本文提出了一种人体行为识别方法，能够充分利用多个视图中的人体行为信息，采用骨架网络和卷积神经网络模型相结合的方式，有效提高了行为识别的准确率。

Abstract

Objective

Action recognition has become increasingly important in industrial manufacturing. Production efficiency and quality can be improved by recognizing worker actions and postures in complex production environments. In recent years， action recognition based on skeletal data has received widespread attention and research， with methods mainly based on graph convolutional networks （GCN） or long short-term memory （LSTM） networks exhibiting excellent recognition performance in experiments. However， these methods have not considered the recognition problems of occlusion， viewpoint changes， and similar subtle actions in the factory environment， which may have a significant impact on subsequent action recognition. Therefore， this study proposes a packing behavior recognition method that combines a dual-view skeleton multi-stream network.

Method

The network model consists of a main network and a sub-network. The main network uses two RGB videos from different perspectives as input and records the input of workers at the same time and action. Subsequently， the image difference method is used to convert the input video data into a difference image. Moreover， the 3D skeleton information of the character is extracted from the depth map by using the 3D pose estimation algorithm and then transmitted to the subsequent viewing angle conversion module. In the perspective conversion module， the rotation of the bone data is used to find the best viewing angle， and the converted skeleton data are passed into a three-layer stacked LSTM network. The different classification scores of the weighted fusion are obtained for the recognition results of the main network. In addition， for some similar behaviors and non-compliant “fake actions”， we use a local positioning image convolution network combined with an attention mechanism and pass it into the ResNeXt network for recognition. Moreover， we introduce a spatio-temporal attention mechanism for analyzing video action recognition sequences to focus on the key frames of the skeleton sequence. The recognition scores of the main network and the sub-network are fused in proportion to obtain the final recognition result and predict the behavior of the person.

Result

First， convolutional neural network （CNN）-based methods usually have better performance than recurrent neural network （RNN）-based ones， whereas GCN-based methods have middling performance. Moreover， CNN and RNN network structures are combined to improve the accuracy and recall rate to greatly explore the spatiotemporal information of skeletons. However， the method proposed in this study has an identification accuracy of packing behavior of 92.31% and a recall rate of 89.72%， which is still 3.96% and 3.81% higher than the accuracy， respectively. The proposed method is significantly ahead of other existing mainstream behavior recognition methods. Second， the method based on a difference image combined with a skeleton extraction algorithm can achieve an 87.6% accuracy， which is better than RGB as the input method of the original image， although the frame rate is reduced to 55.3 frames per second， which is still within the acceptable range. Third， considering the influence of the adaptive transformation module and the multi-view module on the experiment， we find that the recognition rate of the single-stream network with the adaptive transformation module is greatly improved， but the fps is slightly decreased. The experiment finds that the learning of the module is more inclined to observe the action from the front because the front observation can scatter the skeleton as much as possible compared with the side observation. The highest degree of mutual occlusion among bones was the worst observation effect. For dual view， simply fusing two different single-stream output results can improve the performance， and the weighted average method has the best effect， which is 3.83% and 3.03% higher than the accuracy of single-stream S1 and S2， respectively. Some actions have the problem of object occlusion and human self-occlusion under a certain shooting angle. The occlusion problem can be solved by two complementary views， that is， the occluded action can be well recognized in one of the views. In addition， evaluations were carried out on the public NTU RGB+D dataset， where the performance results outperformed other networks. This result further validates the effectiveness and accuracy of the proposed method in the study.

Conclusion

This method uses a two-stream network model. The main network is an adaptive multi-view RNN network. Two depth cameras under complementary perspectives are used to collect the data from the same station， and the incoming RGB image is converted into a differential image for extracting skeleton information. Then， the skeleton data are passed into the adaptive view transformation module to obtain the best skeleton observation points， and the three-layer stacked LSTM network is used to obtain the recognition results. Finally， the weighted fusion of the two view features is used， and the main network solves the influence of occlusion and background clutter. The sub-network adds the hand image recognition of skeleton positioning， and the intercepted local positioning image is sent to the ResNeXt network for recognition to make up for the problem of insufficient accuracy of “fake action” and similar action recognition. Finally， the recognition results of the main network and the sub-network are fused. The human behavior recognition method proposed in this study effectively utilizes human behavior information from multiple views and combines skeleton network and CNN models to significantly improve the accuracy of behavior recognition.

关键词

动作识别长短时记忆网络（LSTM）双视图自适应视图转换注意力机制

Keywords

action recognitionlong short-term memory（LSTM）dual-viewadaptive view transformationattention mechanism

references

Cai X Y， Zhou W G， Wu L， Luo J B and Li H Q. 2016. Effective active skeleton representation for low latency human action recognition. IEEE Transactions on Multimedia， 18（2）： 141-154 ［DOI： 10.1109/TMM.2015.2505089http://dx.doi.org/10.1109/TMM.2015.2505089］

Cao Z， Simon T， Wei S E and Sheikh Y. 2017. Realtime multi-person 2D pose estimation using part affinity fields//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Honolulu， USA： IEEE： 1302-1310 ［DOI： 10.1109/CVPR.2017.143http://dx.doi.org/10.1109/CVPR.2017.143］

Carreira J and Zisserman A. 2017. Quo vadis， action recognition？ A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 4724-4733 ［DOI： 10.1109/CVPR.2017.502http://dx.doi.org/10.1109/CVPR.2017.502］

Das S， Dai R， Koperski M， Minciullo L， Garattoni L， Bremond F and Francesca G. 2019. Toyota Smarthome： real-world activities of daily living//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 833-842 ［DOI： 10.1109/ICCV.2019.00092http://dx.doi.org/10.1109/ICCV.2019.00092］

Fan Z X， Zhao X， Lin T W and Su H S. 2019. Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Transactions on Multimedia， 21（2）： 363-374 ［DOI： 10.1109/tmm.2018.2859620http://dx.doi.org/10.1109/tmm.2018.2859620］

Feichtenhofer C， Pinz A and Zisserman A. 2016. Convolutional two-stream network fusion for video action recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 1933-1941 ［DOI： 10.1109/CVPR.2016.213http://dx.doi.org/10.1109/CVPR.2016.213］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation， 9（8）： 1735-1780 ［DOI： 10.1162/neco.1997.9.8.1735http://dx.doi.org/10.1162/neco.1997.9.8.1735］

Huang Z W， Wan C D， Probst T and van Gool L. 2017. Deep learning on lie groups for skeleton-based action recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 1243-1252 ［DOI： 10.1109/CVPR.2017.137http://dx.doi.org/10.1109/CVPR.2017.137］

Jiang M， Kong J， Bebis G and Huo H T. 2015. Informative joints based human action recognition using skeleton contexts. Signal Processing： Image Communication， 33： 29-40 ［DOI： 10.1016/j.image.2015.02.004http://dx.doi.org/10.1016/j.image.2015.02.004］

Jiang Q Y， Wu X J and Xu T Y. 2022. M2FA： multi-dimensional feature fusion attention mechanism for skeleton-based action recognition. Journal of Image and Graphics， 27（8）： 2391-2403

姜权晏，吴小俊，徐天阳. 2022. 用于骨架行为识别的多维特征嵌合注意力机制. 中国图象图形学报， 27（8）： 2391-2403 ［DOI： 10.11834/jig.210091http://dx.doi.org/10.11834/jig.210091］

Khowaja S A and Lee S L. 2022. Skeleton-based human action recognition with sequential convolutional-LSTM networks and fusion strategies. Journal of Ambient Intelligence and Humanized Computing， 13（8）： 3729-3746 ［DOI： 10.1007/s12652-022-03848-3http://dx.doi.org/10.1007/s12652-022-03848-3］

Li Q， Mo H L， Zhao J H， Hao H X and Li H. 2021. Spatio-temporal dual affine differential invariants for skeleton-based action recognition. Journal of Image and Graphics， 26（12）： 2879-2891

李琪，墨瀚林，赵婧涵，郝宏翔，李华. 2022. 时空双仿射微分不变量及骨架动作识别. 中国图象图形学报， 26（12）： 2879-2891 ［DOI： 10.11834/jig.200453http://dx.doi.org/10.11834/jig.200453］

Li W B， Wen L Y， Chang M C， Lim S N and Lyu S W. 2017. Adaptive RNN tree for large-scale human action recognition//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 1453-1461 ［DOI： 10.1109/ICCV.2017.161http://dx.doi.org/10.1109/ICCV.2017.161］

Liu J， Shahroudy A， Xu D and Wang G. 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition//Proceedings of the 14th Computer Vision. Amsterdam， the Netherlands： Springer： 816-833 ［DOI： 10.1007/978-3-319-46487-9_50http://dx.doi.org/10.1007/978-3-319-46487-9_50］

Liu X， Li Y S and Xia R J. 2021. Adaptive multi-view graph convolutional networks for skeleton-based action recognition. Neurocomputing， 444： 288-300 ［DOI： 10.1016/j.neucom.2020.03.126http://dx.doi.org/10.1016/j.neucom.2020.03.126］

Lo Presti L and La Cascia M. 2016. 3D skeleton-based human action classification： a survey. Pattern Recognition， 53： 130-147 ［DOI： 10.1016/j.patcog.2015.11.019http://dx.doi.org/10.1016/j.patcog.2015.11.019］

Nguyen M H， Hsiao C C， Cheng W H and Huang C C. 2022. Practical 3D human skeleton tracking based on multi-view and multi-Kinect fusion. Multimedia Systems， 28（2）： 529-552 ［DOI： 10.1007/s00530-021-00846-xhttp://dx.doi.org/10.1007/s00530-021-00846-x］

Pavllo D， Feichtenhofer C， Grangier D and Auli M. 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Long Beach， USA： IEEE： 7745-7754 ［DOI： 10.1109/CVPR.2019.00794http://dx.doi.org/10.1109/CVPR.2019.00794］

Qin Y， Mo L F， Li C Y and Luo J Y. 2020. Skeleton-based action recognition by part-aware graph convolutional networks. The Visual Computer， 36（3）： 621-631 ［DOI： 10.1007/s00371-019-01644-3http://dx.doi.org/10.1007/s00371-019-01644-3］

S􀅡nchez-Caballero A， De López-Diz S， Fuentes-Jiménez D， Losada-Gutiérrez C， Marrón-Romera M， Casillas-Pérez D and Sarker M I. 2022. 3DFCNN： real-time action recognition using 3D deep neural networks with raw depth information. Multimedia Tools and Applications， 81（17）： 24119-24143 ［DOI： 10.1007/s11042-022-12091-zhttp://dx.doi.org/10.1007/s11042-022-12091-z］

Senthilkumar N， Manimegalai M， Karpakam S， Ashokkumar S R and Premkumar M. 2022. Human action recognition based on spatial–temporal relational model and LSTM-CNN framework. Materials Today： Proceedings， 57： 2087-2091 ［DOI： 10.1016/j.matpr.2021.12.004http://dx.doi.org/10.1016/j.matpr.2021.12.004］

Shotton J， Sharp T， Kipman A， Fitzgibbon A， Finocchio M， Blake A， Cook M and Moore R. 2013. Real-time human pose recognition in parts from single depth images. Communications of the ACM， 56（1）： 116-124 ［DOI： 10.1145/2398356.2398381http://dx.doi.org/10.1145/2398356.2398381］

Song S J， Lan C L， Xing J L， Zeng W J and Liu J Y. 2017. An end-to-end spatio-temporal attention model for human action recognition from skeleton data//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco， USA： AAAI Press： 4263-4270 ［DOI： 10.1609/aaai.v31i1.11212http://dx.doi.org/10.1609/aaai.v31i1.11212］

Szegedy C， Liu W， Jia Y Q， Sermanet P， Reed S， Anguelov D， Erhan D， Vanhoucke V and Rabinovich R. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 1-9 ［DOI： 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594］

Tang Y S， Tian Y， Lu J W， Li P Y and Zhou J. 2018. Deep progressive reinforcement learning for skeleton-based action recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 5323-5332 ［DOI： 10.1109/CVPR.2018.00558http://dx.doi.org/10.1109/CVPR.2018.00558］

Tao L， Wang X T and Yamasaki T. 2021. Rethinking motion representation： residual frames with 3D ConvNets. IEEE Transactions on Image Processing， 30： 9231-9244 ［DOI： 10.1109/tip.2021.3124156http://dx.doi.org/10.1109/tip.2021.3124156］

Tao S and Wang M L. 2022. Stroke recognition in badminton videos based on pose estimation and temporal segment networks analysis. Journal of Image and Graphics， 27（11）： 3280-3291

陶树，王美丽. 2022. 结合姿态估计和时序分段网络分析的羽毛球视频动作识别. 中国图象图形学报， 27（11）： 3280-3291 ［DOI： 10.11834/jig.210407http://dx.doi.org/10.11834/jig.210407］

Tran D， Bourdev L， Fergus R， Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 4489-4497 ［DOI： 10.1109/ICCV.2015.510http://dx.doi.org/10.1109/ICCV.2015.510］

Varshney N， Bakariya B， Kushwaha A K S and Khare M. 2023. Rule-based multi-view human activity recognition system in real time using skeleton data from RGB-D sensor. Soft Computing， 27（1）： 405-421 ［DOI： 10.1007/s00500-021-05649-whttp://dx.doi.org/10.1007/s00500-021-05649-w］

Wu C Y， Zaheer M， Hu H X， Manmatha R， Smola A J and Krähenbühl P. 2018. Compressed video action recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 6026-6035 ［DOI： 10.1109/CVPR.2018.00631http://dx.doi.org/10.1109/CVPR.2018.00631］

Xie S N， Girshick R， Doll􀅡r P， Tu Z W and He K M. 2017. Aggregated residual transformations for deep neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5987-5995 ［DOI： 10.1109/CVPR.2017.634http://dx.doi.org/10.1109/CVPR.2017.634］

Yang Q S and Mu T J. 2022. Action recognition using ensembling of different distillation-trained spatial-temporal graph convolution models. Journal of Image and Graphics， 27（4）： 1290-1301

杨清山，穆太江. 2022. 采用蒸馏训练的时空图卷积动作识别融合模型. 中国图象图形学报， 27（4）： 1290-1301 ［DOI： 10.11834/jig.200791http://dx.doi.org/10.11834/jig.200791］

Yang W J， Zhang J L， Cai J J and Xu Z Y. 2023. HybridNet： integrating GCN and CNN for skeleton-based action recognition. Applied Intelligence， 53（1）： 574-585 ［DOI： 10.1007/s10489-022-03436-0http://dx.doi.org/10.1007/s10489-022-03436-0］

Zhang P F， Lan C L， Xing J L， Zeng W J， Xue J R and Zheng N N. 2017. View Adaptive recurrent neural networks for high performance human action recognition from skeleton data//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2136-2145 ［DOI： 10.1109/ICCV.2017.233http://dx.doi.org/10.1109/ICCV.2017.233］

Zhang P F， Lan C L， Xing J L， Zeng W J， Xue J R and Zheng N N. 2019. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 41（8）： 1963-1978 ［DOI： 10.1109/TPAMI.2019.2896631http://dx.doi.org/10.1109/TPAMI.2019.2896631］

Zhang X R， Yang Y， Jiao L C and Dong F. 2013. Manifold-constrained coding and sparse representation for human action recognition. Pattern Recognition， 46（7）： 1819-1831 ［DOI： 10.1016/j.patcog.2012.10.011http://dx.doi.org/10.1016/j.patcog.2012.10.011］

文章被引用时，请邮件提醒。

提交

红外与可见光图像特征动态选择的目标检测网络

注意力引导局部特征联合学习的人脸表情识别

结合注意力机制和编码器—解码器架构的化学结构识别方法

阿尔茨海默症诊断与病理区域检测的反事实推理模型

显著性引导的目标互补隐藏弱监督语义分割