结合时空掩码和空间二维位置编码的手势识别
Gesture recognition by combining spatio-temporal mask and spatial 2D position encoding
- 2024年29卷第5期 页码:1421-1433
纸质出版日期: 2024-05-16
DOI: 10.11834/jig.230379
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-05-16 ,
移动端阅览
邓淦森, 丁文文, 杨超, 丁重阳. 2024. 结合时空掩码和空间二维位置编码的手势识别. 中国图象图形学报, 29(05):1421-1433
Deng Gansen, Ding Wenwen, Yang Chao, Ding Chongyang. 2024. Gesture recognition by combining spatio-temporal mask and spatial 2D position encoding. Journal of Image and Graphics, 29(05):1421-1433
目的
2
在动态手势序列特征提取时,忽略了不同动态手势手指间的相关性,是造成手势识别率不高的重要原因。针对此问题,提出了时空位置编码和掩码的方法进行手势识别,是首次对手部关节点进行空间二维位置编码。
方法
2
首先,根据手部关节序列构造时空图,利用关节点平面坐标生成空间二维编码,并与时间轴的一维编码器融合,生成关节点的时空位置编码,可以有效处理空间上的异常姿态同时避免时间上的乱序问题;然后,将时空图按照人体手部生物结构进行分块,通过空间自注意力和空间掩码,获取手指与手指之间的潜在信息。采用时间维度扩张的策略,通过时间自注意力和时间掩码,捕获长时间手指序列动态演变信息。
结果
2
在DHG-14/28(dynamic hand gesture 14/28)数据集上,该算法比HPEV(hand posture evolution volume)算法平均识别率高出4.47%,比MS-ISTGCN (multi-stream improved spatio-temporal graph convolutional network)算法平均识别率高出2.71%;在SHREC’17 track数据集上,该算法比HPEV算法平均识别率高出0.47%,利用消融实验证明了本文策略的合理性。
结论
2
通过大量实验评估,验证了基于分块和时空位置编码构造出来的模型很好地解决了上述问题,提高了手势识别率。
Objective
2
Gesture recognition often neglects the correlation between fingers and pays excessive attention to the node features, which is crucial for the low gesture recognition rate. For example, the index finger and thumb are physically disconnected, but their interaction is important for recognizing the “pinch” action. Thus, the low recognition rate is due to the inability to encode the spatial position of the hand node properly. Dividing the joint of the hand part into blocks is proposed to address the correlation between fingers. The aforementioned problem can be addressed byencoding the two-dimensional position of the joint through its projection coordinates. The authors believe that this study is the first to encode the two-dimensional position of the node in space.
Method
2
The spatiotemporal graph is generated from the gesture sequence.This graph contains the physical connection of the node and its temporal information. Thus, the spatial and temporal characteristics are learned using mask operations.According to the three-dimensional space coordinates of joint nodes, the two-dimensional projection coordinates are obtained, and the two-dimensional projection coordinates are inputted into the two-dimensional space position encoder, which comprises sine and cosine functions with different frequencies.The plane where the projection coordinates are located is divided into several grid cells, and the encoder comprising sine and cosine functions is calculated in each grid cell. The encoders in all grids are combined to form sine and cosine functions with different frequencies to generate the final spatial two-dimensional position code.Embedding the encoded information into the spatial features of the nodes not only strengthens the spatial structure between them but also avoids the disorder of the nodes in the movement process.Using the graph convolutional network to aggregate and embed the spatial encoded node and neighbor features, the spatiotemporal graph features after the graph convolution are inputted into the spatial self-attention module to extract the inter-finger correlation. Taking each finger as the research object, the distribution of nodes in the spatiotemporal graph is divided into blocks according to the biological structure of the human hand. Each finger through a linear learnable change to generate the eigenvector of the finger query (Q), key (K), value (V). The self-attention mechanism is then used to calculate the correlation between fingers in each frame of the space-time graph, the correlation weight between fingers is obtained by combining the spatial mask matrix, and each finger feature is updated. While updating the finger features, the spatial mask matrix is used to disconnect the time relationship between fingers in the spatiotemporal graph, avoiding the influence of time dimension on the spatial correlation weight matrix.The time self-attention module is similarly used to learn the timing features of fingers in the spatiotemporal graph. First, temporal sequence embedding is conducted for each frame through temporal one-dimensional position coding to obtain the temporal sequence information of each frame during model learning. The time dimension expansion strategy is used to fuse the features of the two adjacent frames to capture the interframe correlation at a long distance. A learnable linear change then generates a feature vector query (Q), key (K), and value (V) for each frame. Finally, the self-attention mechanism is utilized to calculate the correlation between each frame in the space-time graph. Simultaneously, the correlation weight matrix between frames in the space-time graph is obtained by combining the time mask matrix, and the features of each frame are updated. Updating the features of each frame also uses the temporal mask matrix to avoid the influence of spatial dimension on the temporal correlation weight matrix. The fully connected network, ReLU activation function, and layer normalization are added to the end of each attention module to improve the training efficiency of the model, and the model finally outputs the learned feature vector for gesture recognition.
Result
2
The model is tested on two challenging datasets: DHG-14/28 and SHREC’17 track. The experimental results show that the model achieves the best recognition rate on DHG-14/28, which is 4.47% and 2.71% higher than the HPEV and the MS-ISTGCN algorithms, respectively, on average.On the SHREC’17 track dataset, the algorithm is 0.47% higher than the HPEV algorithm on average. The ablation experiment proves the need of two-dimensional location coding in space. The experimental test shows that the model has the best recognition rate when node features are 64 dimensions and the number of self-attention head is 8.
Conclusion
2
Numerous experimental evaluations verified that the network model constructed by the block strategy and spatial two-dimensional position coding not only improves the spatial structure of the nodes but also enhances the recognition rate of gestures using the self-attention mechanism to learn the correlation between non-physically connected fingers.
手势识别自注意力空间二维位置编码时空掩码手部分块
gesture recognitionself-attentionspatial two-dimensional position codingspatio-temporal maskhand segmentation
Caputo F M, Prebianca P, Carcangiu A, Spano L D and Giachetti A. 2018. Comparing 3D trajectories for simple mid-air gesture recognition. Computers and Graphics, 73: 17-25 [DOI: 10.1016/J.CAG.2018.02.009http://dx.doi.org/10.1016/J.CAG.2018.02.009]
Chen X H, Guo H K, Wang G J and Zhang L. 2017. Motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition//Proceedings of 2017 IEEE International Conference on Image Processing (ICIP). Beijing, China: IEEE: 2881-2885 [DOI: 10.1109/ICIP.2017.8296809http://dx.doi.org/10.1109/ICIP.2017.8296809]
Chen Y X, Zhao L, Peng X, Yuan J B and Metaxas D N. 2019. Construct dynamic graphs for hand gesture recognition via spatial-temporal attention [EB/OL]. [2023-06-05]. https://arxiv.org/pdf/1907.08871.pdfhttps://arxiv.org/pdf/1907.08871.pdf
Cheng H, Yang L and Liu Z C. 2016. Survey on 3D hand gesture recognition. IEEE Transactions on Circuits and Systems for Video Technology, 26(9): 1659-1673 [DOI: 10.1109/TCSVT.2015.2469551http://dx.doi.org/10.1109/TCSVT.2015.2469551]
de Smedt Q, Wannous H and Vandeborre J P. 2016. Skeleton-based dynamic hand gesture recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Las Vegas, USA: IEEE: 1206-1214 [DOI: 10.1109/CVPRW.2016.153http://dx.doi.org/10.1109/CVPRW.2016.153]
de Smedt Q, Wannous H, Vandeborre J P, Guerry J, Le Saux B and Filliat D. 2017. 3D hand gesture recognition using a depth and skeletal dataset: SHREC’17 track//Proceedings of the Workshop on 3D Object Retrieval. Lyon, France: Eurographics Association: 33-38 [DOI: 10.2312/3dor.20171049http://dx.doi.org/10.2312/3dor.20171049]
Devineau G, Moutarde F, Xi W and Yang J. 2018. Deep learning for hand gesture recognition on skeletal data//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018). Xi’an, China: IEEE: 106-113 [DOI: 10.1109/FG.2018.00025http://dx.doi.org/10.1109/FG.2018.00025]
Ding C Y, Wen S, Ding W W, Liu K and Belyaev E. 2022. Temporal segment graph convolutional networks for skeleton-based action recognition. Engineering Applications of Artificial Intelligence, 110: #104675 [DOI: 10.1016/j.engappai.2022.104675http://dx.doi.org/10.1016/j.engappai.2022.104675]
Fang W, Chen Y P and Xue Q Y. 2021. Survey on research of RNN-based spatio-temporal sequence prediction algorithms. Journal on Big Data, 3(3): 97-110 [DOI: 10.32604/JBD.2021.016993http://dx.doi.org/10.32604/JBD.2021.016993]
He W and Pan C. 2022. The salient object detection based on attention-guided network. Journal of Image and Graphics, 27(4): 1176-1190
何伟, 潘晨. 2022. 注意力引导网络的显著性目标检测. 中国图象图形学报, 27(4): 1176-1190 [DOI: 10.11834/jig.200658http://dx.doi.org/10.11834/jig.200658]
Hou J X, Wang G J, Chen X H, Xue J H, Zhu R and Yang H Z. 2018. Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition//Proceedings of the European Conference on Computer Vision. Munich, Germany: Springer: 273-286 [DOI: 10.1007/978-3-030-11024-6_18http://dx.doi.org/10.1007/978-3-030-11024-6_18]
Jiang Q Y, Wu X J and Xu T Y. 2022. M2FA: multi-dimensional feature fusion attention mechanism for skeleton-based action recognition. Journal of Image and Graphics, 27(8): 2391-2403
姜权晏, 吴小俊, 徐天阳. 2022. 用于骨架行为识别的多维特征嵌合注意力机制. 中国图象图形学报, 27(8): 2391-2403 [DOI: 10.11834/JIG.210091http://dx.doi.org/10.11834/JIG.210091]
Li S C, Liu Z Y, Duan G F and Tan J R. 2023. MVHANet: multi-view hierarchical aggregation network for skeleton-based hand gesture recognition. Signal, Image and Video Processing, 17(5): 2521-2529 [DOI: 10.21203/RS3.RS-2285220http://dx.doi.org/10.21203/RS3.RS-2285220]
Li Y, He Z H, Ye X, He Z G and Han K R. 2019. Spatial temporal graph convolutional networks for skeleton-based dynamic hand gesture recognition. EURASIP Journal on Image and Video Processing, 2019(1): 1-7 [DOI: 10.1186/S13640-019-0476-Xhttp://dx.doi.org/10.1186/S13640-019-0476-X]
Li Y K, Ma D Y, Yu Y H, Wei G S and Zhou Y F. 2021. Compact joints encoding for skeleton-based dynamic hand gesture recognition. Computers and Graphics, 97: 191-199 [DOI: 10.1016/J.CAG.2021.04.017http://dx.doi.org/10.1016/J.CAG.2021.04.017]
Liu J B, Liu Y C, Wang Y, Prinet V, Xiang S M and Pan C H. 2020. Decoupled representation learning for skeleton-based gesture recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 5750-5759 [DOI: 10.1109/CVPR42600.2020.00579http://dx.doi.org/10.1109/CVPR42600.2020.00579]
Mai G C, Janowicz K, Yan B, Zhu R, Cai L and Lao N. 2020. Multi-scale representation learning for spatial feature distributions using grid cells [EB/OL]. [2023-06-05]. https://arxiv.org/pdf/2003.00824.pdfhttps://arxiv.org/pdf/2003.00824.pdf
Miah A S M, Hasan M A M and Shin J. 2023. Dynamic hand gesture recognition using multi-branch attention based graph and general deep learning model. IEEE Access, 11: 4703-4716 [DOI: 10.1109/ACCESS.2023.3235368http://dx.doi.org/10.1109/ACCESS.2023.3235368]
Molchanov P, Gupta S, Kim K and Kautz J. 2015. Hand gesture recognition with 3D convolutional neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Boston, USA: IEEE: 1-7 [DOI: 10.1109/CVPRW.2015.7301342http://dx.doi.org/10.1109/CVPRW.2015.7301342]
Núñez J C, Cabido R, Pantrigo J J, Montemayor A S and Vélez J F. 2018. Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognition, 76: 80-94 [DOI: 10.1016J.PATCOG.2017.10.033http://dx.doi.org/10.1016J.PATCOG.2017.10.033]
Ohn-Bar E and Trivedi M M. 2013. Joint angles similarities and HOG2 for action recognition//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Portland, USA: IEEE: 465-470 [DOI: 10.1109/CVPRW.2013.76http://dx.doi.org/10.1109/CVPRW.2013.76]
Oreifej O and Liu Z C. 2013. HON4D: histogram of oriented 4D normals for activity recognition from depth sequences//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 716-723 [DOI: 10.1109/CVPR.2013.98http://dx.doi.org/10.1109/CVPR.2013.98]
Rautaray S S and Agrawal A. 2015. Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review, 43(1): 1-54 [DOI: 10.1007/s10462-012-9356-9http://dx.doi.org/10.1007/s10462-012-9356-9]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2021. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition//Proceedings of the 15th Asian Conference on Computer Vision. Kyoto, Japan: Springer: 38-53 [DOI: 10.1007/978-3-030-69541-5_3http://dx.doi.org/10.1007/978-3-030-69541-5_3]
Shiri F M, Perumal T, Mustapha N and Mohamed R. 2023. A comprehensive overview and comparative analysis on deep learning models: CNN, RNN, LSTM, GRU [EB/OL]. [2023-06-05]. https://arxiv.org/pdf/2305.17473.pdfhttps://arxiv.org/pdf/2305.17473.pdf
Si C Y, Chen W T, Wang W, Wang L and Tan T N. 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 1227-1236 [DOI: 10.1109/CVPR.2019.00132http://dx.doi.org/10.1109/CVPR.2019.00132]
Song J H, Kong K and Kang S J. 2022. Dynamic hand gesture recognition using improved spatio-temporal graph convolutional network. IEEE Transactions on Circuits and Systems for Video Technology, 32(9): 6227-6239 [DOI: 10.1109/TCSVT.2022.3165069http://dx.doi.org/10.1109/TCSVT.2022.3165069]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000-6010
Wang C, Liu Z and Chan S C. 2015. Superpixel-based hand gesture recognition with kinect depth camera. IEEE Transactions on Multimedia, 17(1): 29-39 [DOI: 10.1109/TMM.2014.2374357http://dx.doi.org/10.1109/TMM.2014.2374357]
Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence and the 13th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. New Orleans, USA: AAAI Press: 7444-7452
相关作者
相关机构