融合软注意力掩码嵌入的场景文本识别方法

陈威达; 王林飞; 陶大鹏

doi:10.11834/jig.230081

图像分析和识别 | 浏览量 : 0 下载量: 6 CSCD: 0

PDF
导出
分享
收藏
专辑

融合软注意力掩码嵌入的场景文本识别方法
SAME-net： scene text recognition method based on soft attention mask embedding
2024年29卷第5期页码：1381-1391
纸质出版日期： 2024-05-16 ，
DOI： 10.11834/jig.230081
稿件说明：

移动端阅览

陈威达，王林飞，陶大鹏. 2024. 融合软注意力掩码嵌入的场景文本识别方法. 中国图象图形学报， 29(05):1381-1391

Chen Weida， Wang Linfei， Tao Dapeng. 2024. SAME-net： scene text recognition method based on soft attention mask embedding. Journal of Image and Graphics， 29(05):1381-1391
陈威达，王林飞，陶大鹏. 2024. 融合软注意力掩码嵌入的场景文本识别方法. 中国图象图形学报， 29(05):1381-1391 DOI： 10.11834/jig.230081.

Chen Weida， Wang Linfei， Tao Dapeng. 2024. SAME-net： scene text recognition method based on soft attention mask embedding. Journal of Image and Graphics， 29(05):1381-1391 DOI： 10.11834/jig.230081.

摘要

目的

基于深度学习的端到端场景文本识别任务已经取得了很大的进展。然而受限于多尺度、任意形状以及背景干扰等问题，大多数端到端文本识别器依然会面临掩码提议不完整的问题，进而影响模型的文本识别结果。为了提高掩码预测的准确率，提出了一种基于软注意力的掩码嵌入模块（soft attention mask embedding，SAME），

方法

利用Transformer更好的全局感受野，将高层特征进行编码并计算软注意力，然后将编码特征与预测掩码层级嵌入，生成更贴近文本边界的掩码来抑制背景噪声。基于SAME强大的文本掩码优化及细粒度文本特征提取能力，进一步提出了一个健壮的文本识别框架SAME-Net，开展无需字符级注释的端到端精准文本识别。具体来说，由于软注意力是可微的，所提出的SAME-Net可以将识别损失传播回检测分支，以通过学习注意力的权重来指导文本检测，使检测分支可以由检测和识别目标联合优化。

结果

在多个文本识别公开数据集上的实验表明了所提方法的有效性。其中，SAME-Net在任意形状文本数据集Total-Text上实现了84.02%的H-mean，相比于2022年的GLASS（global to local attention for scene-text spotting），在不增加额外训练数据的情况下，全词典的识别准确率提升1.02%。所提方法在多向数据集ICDAR 2015（International Conference on Document Analysis and Recognition）也获得了与同期工作相当的性能，取得83.4%的强词典识别结果。

结论

提出了一种基于SAME的端到端文本识别方法。该方法利用Transformer的全局感受野生成靠近文本边界的掩码来抑制背景噪声，提出的SAME模块可以将识别损失反向传输到检测模块，并且不需要额外的文本校正模块。通过检测和识别模块的联合优化，可以在没有字符级标注的情况下实现出色的文本定位性能。

Abstract

Objective

Text detection and recognition of natural scenes is a long-standing and challenging problem. Hence， this study aims to detect and recognize text information in natural scene images. Owing to its wide applications （e.g.， traffic sign recognition and content-based image retrieval）， text detection and recognition has attracted much attention in the field of computer vision. The traditional scene text detection and recognition method regards detection and recognition as two independent tasks. This method first locates and then clips to predict the text area of the input image and to clip the relevant area and then sends the clipped area into the recognizer for recognition. However， this process has some limitations， such as： 1） inaccurate detection results may seriously affect the performance of image text recognition owing to the accumulation of errors between the two tasks， and 2） the separate optimization of the two tasks may not improve the results of text recognition. In recent years， the end-to-end scene text recognition task based on deep learning has made great progress. Many studies have found that detection and recognition are closely related. End-to-end recognition， which integrates detection and recognition tasks， can promote each other and gradually become an important research direction. In the end-to-end recognition task， the natural scene image contains disturbing factors， such as light， deformation， and stain. In addition， scene text can be represented by different colors， fonts， sizes， directions， and shapes， making text detection very difficult. Limited by multi-scale， arbitrary shapes， background interference， and other issues， most end-to-end text recognizers still face the problem of incomplete mask proposals， which will affect the text recognition results of the model. Hence， we propose a mask embedding module （SAME） based on soft attention to improve the accuracy of mask prediction. This module effectively improves the robustness and accuracy of the model.

Method

High-level features are coded， and soft attention is calculated using the global receptive field of Transformer. Then， the coding features and prediction mask are embedded to generate a mask close to the text boundary to suppress background noise. Based on these designs， we propose a simple and robust end-to-end text recognition framework， SAME-Net， because soft attention is differentiable. The proposed SAME module can propagate the recognition loss back to the detection branch to guide the text detection by learning the weight of attention so that the detection and recognition targets can jointly optimize the detection branch. SAME-Net does not need additional recognition modules， nor does it need to annotate the text at the character level.

Result

This method can effectively detect multi-scale and arbitrarily shaped text. The recall rate， accuracy rate， and H-mean value on the public arbitrarily shaped data set Total-Text are 0.884 8 and 0.879 6. Compared with the best results in the comparison method， without adding further training data， the recognition accuracy rate without dictionary guidance is increased by 2.36%， and the recognition accuracy rate of the full dictionary is increased by 5.62%. In terms of detection， the recall rate and H-mean value of this method increased from 0.868 to 0.884 8 and from 0.861 to 0.879 6， respectively， which greatly exceeded the previous method in terms of end-to-end recognition. Both obtained 83.4% strong dictionary recognition results in the multi-directional dataset ICDAR 2015（International Conference on Document Analysis and Recognition）. In short， our method is superior to others.

Conclusion

The performance of SAME-Net proposed in this study has significantly improved on the two scene text data machines of ICDAR 2015 and Total-Text. The best results in this task were obtained. This study proposes an end-to-end text recognition method based on SAME. The proposed method has two advantages. First， the method uses the global receptive field of Transformer to embed high-level coding features and prediction mask levels to generate a mask close to the text boundary to suppress background noise. Second， the proposed SAME module can reverse transmit the recognition loss to the detection module， and no additional text correction module is needed. Great text positioning performance can be achieved without character-level comments through the joint optimization of the detection and recognition modules.

关键词

自然场景文本检测自然场景文本识别软注意力嵌入深度学习端到端自然场景文本检测与识别

Keywords

natural scene text detectionnatural scene text recognitionsoft attention embeddingdeep learningend-to-end natural scene text detection and recognition

references

Bissacco A， Cummins M， Netzer Y and Neven H. 2013. PhotoOCR： reading text in uncontrolled conditions//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney， Australia： IEEE： 785-792 ［DOI： 10.1109/ICCV.2013.102http://dx.doi.org/10.1109/ICCV.2013.102］

Ch’ng C K and Chan C S. 2017. Total-Text： a comprehensive dataset for scene text detection and recognition//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto， Japan： IEEE： 935-942 ［DOI： 10.1109/ICDAR.2017.157http://dx.doi.org/10.1109/ICDAR.2017.157］

Feng W， He W H， Yin F， Zhang X Y and Liu C L. 2019. TextDragon： an end-to-end framework for arbitrary shaped text spotting//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 9075-9084 ［DOI： 10.1109/ICCV.2019.00917http://dx.doi.org/10.1109/ICCV.2019.00917］

Gao L C， Li Y B， Du L， Zhang X P， Zhu Z Y， Lu N， Jin L W， Huang Y S and Tang Z. 2022. A survey on table recognition technology. Journal of Image and Graphics， 27（6）： 1898-1917

高良才，李一博，都林，张新鹏，朱子仪，卢宁，金连文，黄永帅，汤帜. 2022. 表格识别技术研究进展. 中国图象图形学报， 27（6）： 1898-1917 ［DOI： 10.11834/jig.220152http://dx.doi.org/10.11834/jig.220152］

Gupta A， Vedaldi A and Zisserman A. 2016. Synthetic data for text localization in natural image ［EB/OL］. ［2023-02-25］. https://arxiv.org/pdf/1604.06646.pdfhttps://arxiv.org/pdf/1604.06646.pdf

He T， Tian Z， Huang W L， Shen C H， Qiao Y and Sun C M. 2018. An end-to-end textspotter with explicit alignment and attention//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 5020-5029 ［DOI： 10.1109/CVPR.2018.00527http://dx.doi.org/10.1109/CVPR.2018.00527］

Hu J， Cao L J， Lu Y， Zhang S C， Wang Y， Li K， Huang F Y， Shao L and Ji R R. 2021. ISTR： end-to-end instance segmentation with Transformers ［EB/OL］. ［2023-03-12］. https://arxiv.org/pdf/2105.00637.pdfhttps://arxiv.org/pdf/2105.00637.pdf

Karatzas D， Gomez-Bigorda L， Nicolaou A， Ghosh S， Bagdanov A， Iwamura M， Matas J， Neumann L， Chandrasekhar V R， Lu S J， Shafait F， Uchida S and Valveny E. 2015. ICDAR 2015 competition on robust reading//Proceedings of the 13th International Conference on Document Analysis and Recognition. Tunis， Tunisia： IEEE： 1156-1160 ［DOI： 10.1109/ICDAR.2015.7333942http://dx.doi.org/10.1109/ICDAR.2015.7333942］

Li H， Wang P and Shen C H. 2017. Towards end-to-end text spotting with convolutional recurrent neural networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 5248-5256 ［DOI： 10.1109/ICCV.2017.560http://dx.doi.org/10.1109/ICCV.2017.560］

Liao M H， Pang G， Huang J， Hassner T and Bai X. 2020. Mask TextSpotter v3： segmentation proposal network for robust scene text spotting//Proceeding of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 706-722 ［DOI： 10.1007/978-3-030-58621-8_41http://dx.doi.org/10.1007/978-3-030-58621-8_41］

Liao M H， Shi B G， Bai X， Wang X G and Liu W Y. 2017. TextBoxes： a fast text detector with a single deep neural network//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco， USA： AAAI Press： 4161-4167 ［DOI： 10.1609/aaai.v31i1.11196http://dx.doi.org/10.1609/aaai.v31i1.11196］

Lin T Y， Goyal P， Girshick R， He K M and Doll􀅡r P. 2017. Focal loss for dense object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2999-3007 ［DOI： 10.1109/ICCV.2017.324http://dx.doi.org/10.1109/ICCV.2017.324］

Liu C Y， Chen X X， Luo C J， Jin L W， Xue Y and Liu Y L. 2021. Deep learning methods for scene text detection and recognition. Journal of Image and Graphics， 26（6）： 1330-1367

刘崇宇，陈晓雪，罗灿杰，金连文，薛洋，刘禹良. 2021. 自然场景文本检测与识别的深度学习方法. 中国图象图形学报， 26（6）： 1330-1367 ［DOI： 10.11834/jig.210044http://dx.doi.org/10.11834/jig.210044］

Liu X B， Liang D， Yan S， Chen D G， Qiao Y and Yan J J. 2018. FOTS： fast oriented text spotting with a unified network//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 5676-5685 ［DOI： 10.1109/CVPR.2018.00595http://dx.doi.org/10.1109/CVPR.2018.00595］

Liu Y L， Chen H， Shen C H， He T， Jin L W and Wang L W. 2020. ABCNet： real-time scene text spotting with adaptive bezier-curve network//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 9806-9815 ［DOI： 10.1109/CVPR42600.2020.00983http://dx.doi.org/10.1109/CVPR42600.2020.00983］

Liu Y L， Shen C H， Jin L W， He T， Chen P， Liu C Y and Chen H. 2022. ABCNet v2： adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（11）： 8048-8064 ［DOI： 10.1109/TPAMI.2021.3107437http://dx.doi.org/10.1109/TPAMI.2021.3107437］

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021. Swin Transformer： hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Lyu P， Liao M H， Yao C， Wu W H and Bai X. 2018. Mask TextSpotter： an end-to-end trainable neural network for spotting text with arbitrary shapes//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 71-88 ［DOI： 10.1007/978-3-030-01264-9_5http://dx.doi.org/10.1007/978-3-030-01264-9_5］

Qiao L， Chen Y， Cheng Z Z， Xu Y L， Niu Y， Pu S L and Wu F. 2021. MANGO： a mask attention guided one-stage scene text spotter ［EB/OL］. ［2023-02-25］. https://arxiv.org/pdf/2012.04350v1.pdfhttps://arxiv.org/pdf/2012.04350v1.pdf

Qin S Y， Bissaco A， Raptis M， Fujii Y and Xiao Y. 2019. Towards unconstrained end-to-end text spotting//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 4703-4713 ［DOI： 10.1109/ICCV.2019.00480http://dx.doi.org/10.1109/ICCV.2019.00480］

Rezatofighi H， Tsoi N， Gwak J， Sadeghian A， Reid I and Savarese S. 2019. Generalized intersection over union： a metric and a loss for bounding box regression//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 658-666 ［DOI： 10.1109/CVPR.2019.00075http://dx.doi.org/10.1109/CVPR.2019.00075］

Ronen R， Tsiper S， Anschel O， Lavi I， Markovitz A and Manmatha R. 2022. GLASS： global to local attention for scene-text spotting ［EB/OL］. ［2023-02-25］. https://arxiv.org/pdf/2208.03364.pdfhttps://arxiv.org/pdf/2208.03364.pdf

Wang H， Lu P， Zhang H， Yang M K， Bai X， Xu Y C， He M C， Wang Y P and Liu W Y. 2020. All you need is boundary： toward arbitrary-shaped text spotting//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York， USA： AAAI Press： 12160-12167 ［DOI： 10.48550/AAAI.2020.v34i07.6896http://dx.doi.org/10.48550/AAAI.2020.v34i07.6896］

Wang K， Babenko B and Belongie S. 2011. End-to-end scene text recognition//Proceedings of 2011 International Conference on Computer Vision. Barcelona， Spain： IEEE： 1457-1464 ［DOI： 10.1109/ICCV.2011.6126402http://dx.doi.org/10.1109/ICCV.2011.6126402］

Wang P F， Zhang C Q， Qi F， Liu S S， Zhang X Q， Lyu P Y， Han J Y， Liu J T， Ding E R and Shi G M. 2021. PGNet： real-time arbitrarily-shaped text spotting with point gathering network ［EB/OL］. ［2023-03-12］. https://arxiv.org/pdf/2104.05458.pdfhttps://arxiv.org/pdf/2104.05458.pdf

Wang W H， Xie E Z， Li X， Liu X B， Liang D， Yang Z B， Lu T and Shen C H. 2022. PAN++： towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Transactions on Pattern Analysis and Machine Intelligence， 44（9）： 5349-5367 ［DOI： 10.1109/TPAMI.2021.3077555http://dx.doi.org/10.1109/TPAMI.2021.3077555］

Zhang Q M， Xu Y F， Zhang J and Tao D C. 2023. ViTAEv2： vision Transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision， 131（5）： 1141-1162 ［DOI： 10.1007/s11263-022-01739-whttp://dx.doi.org/10.1007/s11263-022-01739-w］

Zhong H M， Tang J， Wang W H， Yang Z B， Yao C and Lu T. 2021. ARTS： eliminating inconsistency between text detection and recognition with auto-rectification text spotter ［EB/OL］. ［2023-02-25］. https://arxiv.org/pdf/2110.10405.pdfhttps://arxiv.org/pdf/2110.10405.pdf

文章被引用时，请邮件提醒。

提交