令牌损失信息的通用文本攻击检测
Universal detection method for mitigating adversarial text attacks through token loss information
- 2024年29卷第7期 页码:1875-1888
纸质出版日期: 2024-07-16
DOI: 10.11834/jig.230432
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-07-16 ,
移动端阅览
陈宇涵, 杜侠, 王大寒, 吴芸, 朱顺痣, 严严. 2024. 令牌损失信息的通用文本攻击检测. 中国图象图形学报, 29(07):1875-1888
Chen Yuhan, Du Xia, Wang Dahan, Wu Yun, Zhu Shunzhi, Yan Yan. 2024. Universal detection method for mitigating adversarial text attacks through token loss information. Journal of Image and Graphics, 29(07):1875-1888
目的
2
文本对抗攻击主要分为实例型攻击和通用非实例型攻击。以通用触发器(universal trigger,UniTrigger)为代表的通用非实例型攻击对文本预测任务造成严重影响,该方法通过生成特定攻击序列使得目标模型预测精度降至接近零。为了抵御通用文本触发器攻击的侵扰,本文从图像对抗性样本检测器中得到启发,提出一种基于令牌损失权重信息的对抗性文本检测方法(loss-based detect universal adversarial attack,LBD-UAA),针对UniTrigger攻击进行防御。
方法
2
首先LBD-UAA分割目标样本为独立令牌序列,其次计算每个序列的令牌损失权重度量值(token-loss value,TLV)以此建立全样本序列查询表。最后基于UniTrigger攻击的扰动序列在查询表中影响值较大,将全序列查询表输入设定的差异性检测器中通过阈值阀门进行对抗性文本检测。
结果
2
通过在4个数据集上进行性能检测实验,验证所提出方法的有效性。结果表明,此方法在对抗性样本识别准确率上高达97.17%,最高对抗样本召回率达到100%。与其他3种检测方法相比,LBD-UAA在真阳率和假阳率的最佳性能达到99.6%和6.8%,均实现大幅度超越。同时,通过设置先验判断将短样本检测的误判率降低约50%。
结论
2
针对UniTrigger为代表的非实例通用式对抗性攻击提出LBD-UAA检测方法,并在多个数据集上取得最优的检测结果,为文本对抗检测提供一种更有效的参考机制。
Objective
2
In recent years, adversarial text attacks have become a hot research problem in natural language processing security. An adversarial text attack is an malicious attack that misleads a text classifier by modifying the original text to craft an adversarial text. Natural language processing tasks, such as smishing scams (SMS), ad sales, malicious comments, and opinion detection, can be achieved by creating attacks corresponding to them to mislead text classifiers. A perfect text adversarial example needs to have imperceptible adversarial perturbation and unaffected syntactic-semantic correctness, which significantly increases the difficulty of the attack. The adversarial attack methods in the image domain cannot be directly applied to textual attacks due to discrete text limitation. Existing text attacks can be categorized into two dominant groups: instance-based and learning-based universal non-instance attacks. For instance-based attacks, a specific adversarial example is generated for each input. For learning-based universal non-instance attacks, universal trigger (UniTrigger) is the most representative attack, which reduces the accuracy of the objective model to near zero by generating a fixed sequence of attacks. Existing detection methods mainly tackle instance-based attacks but are seldom studied in UniTrigger attacks. Inspired by the logit-based adversarial detector in computer vision, we propose a UniTrigger defense method based on token loss weight information.
Method
2
For our proposed loss-based detect universal adversarial attack (LBD-UAA), we generalize the pre-training model to transform token sequences into word vector sequences to obtain the representation of token sequences in the semantic space. Then, we remove the target to compute the token positions and feed the remaining token sequence strings into the model. In this paper, we use the token loss value (TLV) metric to obtain the weight proportion of each token to build a full-sample sequence lookup table. The token sequences of non-UniTrigger attacks have less fluctuation than the adversarial examples in the variation of the TLV metric. Prior knowledge suggests that the fluctuations in the token sequence are the result of adversarial perturbations generated by UniTrigger. Hence, we envision deriving the distinct numerical differences between the TLV full-sequence lookup table and clean samples, as well as adversarial samples. Subsequently, we can employ the differential outcomes as the data representation for the samples. Building upon this approach, we can set a differential threshold to confine the magnitude of variations. If the magnitude exceeds this threshold, then the input sample will be identified as an adversarial instance.
Result
2
To demonstrate the efficacy of the proposed approach, we conducted performance evaluations on four widely used text classification datasets: SST-2, MR, AG, and Yelp. SST-2 and MR represent short-text datasets, while AG and Yelp encompass a variety of domain-specific news articles and website reviews, making them long-text datasets. First, we generated corresponding trigger sequences by attacking specific categories of the four text datasets through the UniTrigger attack framework. Subsequently, we blended the adversarial samples evenly with clean samples and fed them into the LBD-UAA for adversarial detection. Experimental results across the four datasets indicate that this method achieves a maximum detection rate of 97.17%, with a recall rate reaching 100%. When compared with four other detection methods, our proposed approach achieves an overall outperformance with a true positive rate of 99.6% and a false positive rate of 6.8%. Even for the challenging MR dataset, it retains a 96.2% detection rate and outperforms the state-of-the-art approaches. In the generalization experiments, we performed detection on adversarial samples generated using three attack methods from TextBugger and the PWWS attack. Results indicate that LBD-UAA achieves strong detection performance across the four different word-level attack methods, with an average true positive rate for adversarial detection reaching 86.77%, 90.98%, 90.56%, and 93.89%. This finding demonstrates that LBD-UAA possesses significant discriminative capabilities in detecting instance-specific adversarial samples, showcasing robust generalization performance. Moreover, we successfully reduced the false positive rate of short sample detection to 50% by using our proposed differential threshold setting.
Conclusion
2
In this paper, we follow the design idea of adversarial detection tasks in the image domain and, for the first time in the general text adversarial domain, introduce a detection method called LBD-UAA, which leverages token weight information from the perspective of token loss measurement, as measured by TLV. We are the first to detect UniTrigger attacks by using token loss weights in the adversarial text domain. This method is tailored for learning-based universal category adversarial attacks and has been evaluated for its defensive capabilities in sentiment analysis and text classification models in two short-text and long-text datasets. During the experimental process, we observed that the numerical feedback from TLV can be used to identify specific locations where perturbation sequences were added to some samples. Future work will focus on using the proposed detection method to eliminate high-risk samples, potentially allowing for the restoration of adversarial samples. We believe that LBD-UAA opens up additional possibilities for exploring future defenses against UniTrigger-type and other text-based adversarial strategies and that it can provide a more effective reference mechanism for adversarial text detection.
文本对抗样本通用触发器文本分类深度学习对抗性检测
adversarial text examplesuniversal triggerstext classificationdeep learningadversarial detection
Anish A, Carlini N and Wagner D. 2018. Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples//Proceedings of the 35th International Conference on Machine Learning (ICML 2018). Stockholm, Sweden: PMLR: 274-283
Alzantot M, Sharma Y, Elgohary A, Ho B J, Srivastava M B and Chang K W. 2018. Generating natural language adversarial examples//Proceeding of the Empirical Methods in Natural Language Processing (EMNLP 2018). Brussels, Belgium: Association for Computational Linguistics: 2890-2896 [DOI: 10.18653/v1/d18-1316http://dx.doi.org/10.18653/v1/d18-1316]
Behjati M, Moosavi-Dezfooli S M, Baghshah M S and Frossard P. 2019. Universal adversarial attacks on text classifiers//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE: 7345-7349 [DOI: 10.1109/ICASSP.2019.8682430http://dx.doi.org/10.1109/ICASSP.2019.8682430]
Bajaj A and Vishwakarma D K. 2023. Evading text based emotion detection mechanism via adversarial attacks. Neurocomputing, 558: #126787 [DOI: 10.1016/J.NEUCOM.2023.126787http://dx.doi.org/10.1016/J.NEUCOM.2023.126787]
Cer D, Yang Y F, Kong S Y, Hua N, Limtiaco N, John R S, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Strope B and Kurzweil R. 2018. Universal sentence encoder for English//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics: 169-174 [DOI: 10.18653/v1/d18-2029http://dx.doi.org/10.18653/v1/d18-2029]
Dong Y P, Liao F Z, Pang T Y, Su H, Zhu J, Hu X L and Li J G. 2018. Boosting adversarial attacks with momentum//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018). Salt Lake City, USA: IEEE: 9185-9193 [DOI: 10.1109/cvpr.2018.00957http://dx.doi.org/10.1109/cvpr.2018.00957]
Ebrahimi J, Rao A Y, Lowd D and Dou D J. 2018. HotFlip: white-box adversarial examples for text classification//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Melbourne, Australia: Association for Computational Linguistics: 31-36 [DOI: 10.18653/v1/P18-2006http://dx.doi.org/10.18653/v1/P18-2006]
Fang X J and Wang W. 2023. Defending machine reading comprehension against question-targeted attacks//Proceedings of 2023 International Joint Conference on Neural Networks (IJCNN). Gold Coast, Australia: IEEE: 1-8 [DOI: 10.1109/IJCNN54540.2023.10191697http://dx.doi.org/10.1109/IJCNN54540.2023.10191697]
Mosca E, Wich M and Groh G. 2021. Understanding and interpreting the impact of user context in hate speech detection//Proceedings of the 9th International Workshop on Natural Language Processing for Social Media. [s.l.]: Association for Computational Linguistics: 91-102 [DOI: 10.18653/v1/2021.socialnlp-1.8http://dx.doi.org/10.18653/v1/2021.socialnlp-1.8]
Goodfellow I, Shlens J and Szegedy C. 2015. Explaining and harnessing adversarial examples//Proceedings of 2015 International Conference on Learning Representations (ICLR 2015)
Gao J, Lanchantin J, Soffa M L and Qi Y J. 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers//Proceedings of 2018 IEEE Security and Privacy Workshops. San Francisco, USA: IEEE: 50-56 [DOI: 10.1109/SPW.2018.00016http://dx.doi.org/10.1109/SPW.2018.00016]
Iyyer M, Wieting J, Gimpel K and Zettlemoyer L. 2018. Adversarial example generation with syntactically controlled paraphrase networks//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, USA: Association for Computational Linguistics: 1875-1885 [DOI: 10.18653/v1/n18-1170http://dx.doi.org/10.18653/v1/n18-1170]
Jin D, Jin Z J, Zhou J T and Szolovits P. 2020. Is BERT really robust? a strong baseline for natural language attack on text classification and entailment//Proceedings of 2020 Association for the Advancement of Artificial Intelligence (AAAI 2020). New York, USA: AAAI: 8018-8025 [DOI: 10.1609/aaai.v34i05.6311http://dx.doi.org/10.1609/aaai.v34i05.6311]
Yuan L F, Zhang Y C, Chen Y Y and Wei W. 2023. Bridge the gap between CV and NLP! A gradient-based textual adversarial attack framework//Proceedings of 2023 Association for Computational Linguistics (ACL). Toronto, Canada: Association for Computational Linguistics: 7132-7146 [DOI: 10.18653/V1/2023.finding-acl.446http://dx.doi.org/10.18653/V1/2023.finding-acl.446]
Le T, Park N and Lee D. 2021. A sweet rabbit hole by DARCY: using honeypots to detect universal trigger’s adversarial attacks//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. [s.l.]: Association for Computational Linguistics: 3831-3844 [DOI: 10.18653/v1/2021.acl-lOng.296http://dx.doi.org/10.18653/v1/2021.acl-lOng.296]
Le T, Wang S H and Lee D. 2020. MALCOM: generating malicious comments to attack neural fake news detection models//Proceedings of 2020 International Conference on Data Mining (ICDM 2020). Sorrento, Italy: IEEE: 282-291 [DOI: 10.1109/ICDM50108.2020.00037http://dx.doi.org/10.1109/ICDM50108.2020.00037]
Li J F, Ji S L, Du T Y, Li B and Wang T. 2019. TextBugger: generating adversarial text against real-world applications//Proceedings of 2019 Network and Distributed System Security Symposium (NDSS 2019). San Diego, USA: The Internet Society [DOI: 10.14722/ndss.2019.23138http://dx.doi.org/10.14722/ndss.2019.23138]
Li S, Zhao Z, Hu R F, Li W S, Liu T and Du X Y. 2018. Analogical reasoning on Chinese morphological and semantic relations//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018). Melbourne, Australia: Association for Computational Linguistics: 138-143 [DOI: 10.18653/v1/p18-2023http://dx.doi.org/10.18653/v1/p18-2023]
McCann B, Bradbury J, Xiong C M and Socher R. 2017. Learned in translation: contextualized word vectors//Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS). Long Beach, USA: Curran Associates Inc.: 6297-6308 [DOI: 10.5555/3295222.3295377http://dx.doi.org/10.5555/3295222.3295377]
Ma X J, Li B, Wang Y S, Erfani S M, Wijewickrema S N R, Schoenebeck G, Song D, Houle M E and Bailey J. 2018. Characterizing adversarial subspaces using local intrinsic dimensionality//Proceedings of the 6th International Conference on Learning Representations (ICLR 2018. Vancouver, Canada: ICLR
Moosavi-Dezfooli S M, Fawzi A, Fawzi O and Frossard P. 2017. Universal adversarial perturbations//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, USA: IEEE: 1765-1773 [DOI: 10.1109/CVPR.2017.17http://dx.doi.org/10.1109/CVPR.2017.17]
Pang B and Lee L. 2005. Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Ann Arbor, Michigan, USA: Association for Computational Linguistics: 115-124 [DOI: 10.3115/1219840.1219855http://dx.doi.org/10.3115/1219840.1219855]
Gan W C and Ng H T. 2019. Improving the robustness of question answering systems to question paraphrasing//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Florence, Italy: Association for Computational Linguistics: 6065-6075 [DOI: 10.18653/v1/p19-1610http://dx.doi.org/10.18653/v1/p19-1610]
Pruthi D, Dhingra B and Lipton Z C. 2019. Combating adversarial misspellings with robust word recognition//Proceedings of the 57th Association for Computational Linguistics (ACL 2019). Florence, Italy: Association for Computational Linguistics: 5582-5591 [DOI: 10.18653/v1/p19-1561http://dx.doi.org/10.18653/v1/p19-1561]
Papernot N, McDaniel P, Swami A and Harang R. 2016. Crafting adversarial input sequences for recurrent neural networks//Proceedings of 2016 Military Communications Conference (MILCOM 2016). Baltimore, USA: IEEE: 49-54 [DOI: 10.1109/MILCOM.2016.7795300http://dx.doi.org/10.1109/MILCOM.2016.7795300]
Qian S C, Wen Y H, Ma Y F and Mao X W. 2022. Adversial sample attack and defense methods based on deep neural networks. Information Security and Technology, 13(5): 77-86
钱申诚, 文宇恒, 马耀飞, 毛鑫唯. 2022. 基于深度神经网络的对抗样本攻击与防御方法研究. 网络空间安全, 13(5): 77-86 [DOI: 10.3969/j.issn.1674-9456.2022.05.014http://dx.doi.org/10.3969/j.issn.1674-9456.2022.05.014]
Rodriguez N and Rojas-Galeano S. 2018. Shielding Google’s language toxicity model against adversarial attacks [EB/OL]. [2023-05-29]. https://arxiv.org/pdf/1801.01828.pdfhttps://arxiv.org/pdf/1801.01828.pdf
Ren S H, Deng Y H, He K and Che W X. 2019. Generating natural language adversarial examples through probability weighted word saliency//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). Florence, Italy: Association for Computational Linguistics: 1085-1097 [DOI: 10.18653/v1/p19-1103http://dx.doi.org/10.18653/v1/p19-1103]
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I J and Fergus R. 2014. Intriguing properties of neural networks//Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014. Banff, Canada: ICLR
Smith L and Gal Y. 2018. Understanding measures of uncertainty for adversarial example detection//Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI 2019). Monterey, USA: AUAI: 560-569
Wallace E, Feng S, Kandpal N, Gardner M and Singh S. 2019. Universal adversarial triggers for attacking and analyzing NLP//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics: 2153-2163 [DOI: 10.18653/v1/D19-1221http://dx.doi.org/10.18653/v1/D19-1221]
Wang A, Singh A, Michael J, Hill F, Levy O and Bowman S R. 2018. GLUE: a multi-task benchmark and analysis platform for natural language understanding//Proceedings of 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Brussels, Belgium: Association for Computational Linguistics: 353-355 [DOI: 10.18653/v1/w18-5446http://dx.doi.org/10.18653/v1/w18-5446]
Wang X S and He K. 2021. Enhancing the transferability of adversarial attacks through variance tuning//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021). Nashville, USA: IEEE: 1924-1933 [DOI: 10.1109/CVPR46437.2021.00196http://dx.doi.org/10.1109/CVPR46437.2021.00196]
Xu H, Ma Y, Liu H C, Deb D, Liu H, Tang J L and Jain A K. 2020. Adversarial attacks and defenses in images, graphs and text: a review. International Journal of Automation and Computing, 17(2): 151-178 [DOI: 10.1007/s11633-019-1211-xhttp://dx.doi.org/10.1007/s11633-019-1211-x]
Yuan T H, Ji S H, Zhang P C, Cai H B, Dai Q Y, Ye S J and Ren B. 2022. Adversarial example generation method for black box intelligent speech software. Journal of Software, 33(5): 1569-1586
袁天昊, 吉顺慧, 张鹏程, 蔡涵博, 戴启印, 叶仕俊, 任彬. 2022. 针对黑盒智能语音软件的对抗样本生成方法. 软件学报, 33(5): 1569-1586 [DOI: 10.13328/j.cnki.jos.006549http://dx.doi.org/10.13328/j.cnki.jos.006549]
Zhang X, Zhao J B and LeCun Y. 2015. Character-level convolutional networks for text classification//Proceedings of 2015 Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NeurIPS 2015). Montreal, Canada: NIPS: 649-657
Zhang W E, Sheng Q Z, Alhazmi A and Li C L. 2020. Adversarial attacks on deep-learning models in natural language processing: a survey. ACM Transactions on Intelligent Systems and Technology, 11(3): #24 [DOI: 10.1145/3374217http://dx.doi.org/10.1145/3374217]
Zhao J J, Wang J W and Wu J F. 2023. Adversarial attack method identification model based on multi-factor compression error. Journal of Image and Graphics, 28(3): 850-863
赵俊杰, 王金伟, 吴俊凤. 2023. 基于多质量因子压缩误差的对抗样本攻击方法识别. 中国图象图形学报, 28(3): 850-863 [DOI: 10.11834/jig.220516http://dx.doi.org/10.11834/jig.220516]
Zeng J H, Xu J H, Zheng X Q and Huang X J. 2023. Certified robustness to text adversarial attacks by randomized. Computational Linguistics, 49(2): 395-427 [DOI: 10.1162/coli_a_00476http://dx.doi.org/10.1162/coli_a_00476]
相关作者
相关机构