多模态情感识别与理解发展现状及趋势

陶建华; 范存航; 连政; 吕钊; 沈莹; 梁山

doi:10.11834/jig.240017

生成式大模型与人机交互 | 浏览量 : 0 下载量: 405 CSCD: 1

PDF
导出
分享
收藏
专辑

多模态情感识别与理解发展现状及趋势
Development of multimodal sentiment recognition and understanding
2024年29卷第6期页码：1607-1627
收稿日期：2024-01-06，

修回日期：2024-02-18，

纸质出版日期：2024-06-16
DOI： 10.11834/jig.240017
稿件说明：

移动端阅览

陶建华，范存航，连政，吕钊，沈莹，梁山. 2024. 多模态情感识别与理解发展现状及趋势. 中国图象图形学报， 29(06):1607-1627 DOI： 10.11834/jig.240017.

Tao Jianhua， Fan Cunhang， Lian Zheng， Lyu Zhao， Shen Ying， Liang Shan. 2024. Development of multimodal sentiment recognition and understanding. Journal of Image and Graphics， 29(06):1607-1627 DOI： 10.11834/jig.240017.

摘要

情感计算是人工智能领域的一个重要分支，在交互、教育、安全和金融等众多领域应用广泛。单纯依靠语音、视频单一模态的情感识别并不符合人类对情感的感知模式，在受到干扰的情况下识别准确率会迅速下降。为了充分挖掘不同模态数据的互补性，多模态融合的情感识别研究正日益受到研究人员的广泛重视。本文分别从多模态情感识别概述、多模态情感识别与理解、抑郁症情感障碍检测及干预3个维度介绍多模态情感计算研究现状。本文认为具备可扩展性的情感特征设计、基于大模型迁移学习的识别方法将是未来的发展方向，并在解决抑郁、焦虑等情感障碍方面的作用日益凸显。

Abstract

Affective computing is an important branch in the field of artificial intelligence （AI）. It aims to build a computational system that can automatically perceive， recognize， understand， and provide feedback on human emotions. It involves the intersection of multiple disciplines such as computer science， neuroscience， psychology， and social science. Deep emotional understanding and interaction can enable computers to better understand and respond to human emotional needs. It can also provide personalized interactions and feedback based on emotional states， which enhances the human-computer interaction experience. It has various applications in areas such as intelligent assistants， virtual reality， and smart healthcare. Relying solely on single-modal information， such as speech signal or video， does not align with the way humans perceive emotions. The accuracy of recognition rapidly decreases when faced with interference. Multimodal emotion understanding and interaction technologies aim to fully model multidimensional information from audio， video， and physiological signals to achieve more accurate emotion understanding. This technology is fundamental and an important prerequisite for achieving natural， human-like， and personalized human-computer interaction. It holds significant value for ushering in the era of intelligence and digitalization. Multimodal fusion for sentiment recognition receives increasing attention from researchers in fully exploiting the complementary nature of different modalities. This study introduces the current research status of multimodal sentiment computation from three dimensions： an overview of multimodal sentiment recognition， multimodal sentiment understanding， and detection and assessment of emotional disorders such as depression. The overview of emotion recognition is elaborated from the aspects of academic definition， mainstream datasets， and international competitions. In recent years， large language models （LLMs） have demonstrated excellent modeling capabilities and achieved great success in the field of natural language processing with their outstanding language understanding and reasoning abilities. LLMs have garnered widespread attention because of their ability to handle various complex tasks by understanding prompts with minimal or zero-shot learning. Through methods such as self-supervised learning or contrastive learning， LLMs can learn more expressive multimodal representations， which can capture the correlations between different modalities and emotional information. Multimodal sentiment recognition and understanding are discussed in terms of emotion feature extraction， multimodal fusion， and the representation and models involved in sentiment recognition under the background of pre-trained large models. With the rapid development of society， people are facing increasing pressure， which can lead to feelings of depression， anxiety， and other negative emotions. Those who are in a prolonged state of depression and anxiety are more likely to develop mental illnesses. Depression is a common and serious condition， with symptoms including low mood， poor sleep quality， loss of appetite， fatigue， and difficulty concentrating. Depression not only harms individuals and families but also causes significant economic losses to society. The detection of emotional disorders starts from specific applications， which selects depression as the most common emotional disorder. We analyze its latest developments and trends from the perspectives of assessment and intervention. In addition， this study provides a detailed comparison of the research status of affective computation domestically， and prospects for future development trends are offered. We believe that scalable emotion feature design and large-scale model transfer learning based methods will be the future directions of development. The main challenge in multimodal emotion recognition lies in data scarcity， which means that data available to build and explore complex models are insufficient. This insufficiency causes difficulty in creating robust models based on deep neural network methods. The above mentioned issues can be addressed by constructing large-scale multimodal emotion databases and exploring transfer learning methods based on large models. By transferring knowledge learned from unsupervised tasks or other tasks to emotion recognition tasks， the problem of limited data resources can be alleviated. The use of explicit discrete and dimensional labels to represent ambiguous emotional states has limitations due to the inherent fuzziness of emotions. Enhancing the interpretability of prediction results to improve the reliability of recognition results is also an important research direction for the future. The role of multimodal emotion computing in addressing emotional disorders such as depression and anxiety is increasingly prominent. Future research can be conducted in the following three areas. First， research and construction of multimodal emotion disorder datasets can provide a solid foundation for the automatic recognition of emotional disorders. However， this field still needs to address challenges such as data privacy and ethics. In addition， considerations such as designing targeted interview questions， ensuring patient safety during data collection， and sample augmentation through algorithms are still worth exploring. Second， more effective algorithms should be developed. Emotional disorders fall within the psychological domain， and they can also affect the physiological features of patients， such as voice and body movements. This psychological-physiological correlation is worthy of comprehensive exploration. Therefore， improving the accuracy of algorithms for multimodal emotion disorder recognition is a pressing research issue. Finally， intelligent psychological intervention systems should be designed and implemented. The following issues can be further studied： effectively simulating the counseling process of a psychologist， promptly receiving user emotional feedback， and generating empathetic conversations.

关键词

Keywords

references

Ahmed A ， Ali N ， Aziz S ， Abd-Alrazaq A A ， Hassan A ， Khalifa M ， Elhusein B ， Ahmed M ， Ahmed M A S and Househ M . 2021 . A review of mobile chatbot apps for anxiety and depression and their self-care features . Computer Methods and Programs in Biomedicine Update ， 1 ： # 3100012 ［ DOI： 10.1016/j.cmpbup.2021.100012 http://dx.doi.org/10.1016/j.cmpbup.2021.100012 ］

Alghowinem S ， Goecke R ， Wagner M ， Epps J ， Gedeon T ， Breakspear M and Parker G . 2013 . A comparative study of different classifiers for detecting depression from spontaneous speech // Proceedings of 2013 IEEE International Conference on Acoustics， Speech and Signal Processing . Vancouver ： Canada： IEEE： 8022 - 8026 ［ DOI： 10.1109/ICASSP.2013.6639227 http://dx.doi.org/10.1109/ICASSP.2013.6639227 ］

Alhanai T ， Ghassemi M and Glass J . 2018 . Detecting depression with audio/text sequence modeling of interviews // Interspeech 2018 . Hyderabad， India ：［s.n.］： 1716 - 1720 ［ DOI： 10.21437/Interspeech.2018-2522 http://dx.doi.org/10.21437/Interspeech.2018-2522 ］

Amos B ， Ludwiczuk B and Satyanarayanan M . 2016 . OpenFace： a general-purpose face recognition library with mobile applications . CMU School of Computer Science ， 6 （ 2 ）： #20

Andersson G and Cuijpers P . 2009 . Internet-based and other computerized psychological treatments for adult depression： a meta-analysis . Cognitive Behaviour Therapy ， 38 （ 4 ）： 196 - 205 ［ DOI： 10.1080/16506070903318960 http://dx.doi.org/10.1080/16506070903318960 ］

Ando A ， Masumura R ， Takashima A ， Suzuki S ， Makishima N ， Suzuki K ， Moriya K ， Ashihara T and Sato H . 2022 . On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis // Proceedings of 2022 IEEE Spoken Language Technology Workshop （SLT） . Doha， Qatar ： IEEE： 739 - 746 ［ DOI： 10.1109/SLT54892.2023.10022548 http://dx.doi.org/10.1109/SLT54892.2023.10022548 ］

Arroll B ， Smith F G ， Kerse N ， Fishman T and Gunn J . 2005 . Effect of the addition of a ‘help’ question to two screening questions on specificity for diagnosis of depression in general practice： diagnostic validity study . BMJ ， 331 （ 7521 ）： # 884 ［ DOI： 10.1136/bmj.38607.464537.7C http://dx.doi.org/10.1136/bmj.38607.464537.7C ］

Bakker D ， Kazantzis N ， Rickwood D and Rickard N . 2016 . Mental health smartphone apps： review and evidence-based recommendations for future developments . JMIR Mental Health ， 3 （ 1 ）： # 4984 ［ DOI： 10.2196/mental.4984 http://dx.doi.org/10.2196/mental.4984 ］

Bao H B ， Dong L ， Wei F R ， Wang W H ， Yang N ， Liu X D ， Wang Y ， Piao S H ， Gao J F ， Zhou M and Hon H W . 2020 . UniLMv2： pseudo-masked language models for unified language model pre-training //Proceedings of the 37th International Conference on Machine Learning. ［s.l.］： JMLR .org： 642 - 652

Barak A ， Hen L ， Boniel-Nissim M and Shapira N . 2008 . A comprehensive review and a meta-analysis of the effectiveness of internet-based psychotherapeutic interventions . Journal of Technology in Human Services ， 26 （ 2/4 ）： 109 - 160 ［ DOI： 10.1080/15228830802094429 http://dx.doi.org/10.1080/15228830802094429 ］

Bell C C . 1994 . DSM-IV： diagnostic and statistical manual of mental disorders . JAMA ， 272 （ 10 ）： 828 - 829 ［ DOI： 10.1001/jama.1994.03520100096046 http://dx.doi.org/10.1001/jama.1994.03520100096046 ］

Bhakta R ， Savin-Baden M and Tombs G . 2014 . Sharing secrets with robots? // Proceedings of 2014 World Conference on Educational Multimedia， Hypermedia and Telecommunications . Chesapeake， VA， USA ： Association for the Advancement of Computing in Education （AACE）： 2295 - 2301

Bickmore T W ， Mitchell S E ， Jack B W ， Paasche-Orlow M K ， Pfeifer L M and Odonnell J . 2010 . Response to a relational agent by hospital patients with depressive symptoms . Interacting with Computers ， 22 （ 4 ）： 289 - 298 ［ DOI： 10.1016/j.intcom.2009.12.001 http://dx.doi.org/10.1016/j.intcom.2009.12.001 ］

Busso C ， Bulut M ， Lee C C ， Kazemzadeh A ， Mower E ， Kim S ， Chang J N ， Lee S and Narayanan S N . 2008 . IEMOCAP： interactive emotional dyadic motion capture database . Language Resources and Evaluation ， 42 （ 4 ）： 335 - 359 ［ DOI： 10.1007/s10579-008-9076-6 http://dx.doi.org/10.1007/s10579-008-9076-6 ］

Cai H S ， Yuan Z Q ， Gao Y W ， Sun S T ， Li N ， Tian F Z ， Xiao H ， Li J X ， Yang Z W ， Li X W ， Zhao Q L ， Liu Z Y ， Yao Z J ， Yang M Q ， Peng H ， Zhu J ， Zhang X W ， Gao G P ， Zheng F ， Li R ， Guo Z H ， Ma R ， Yang J ， Zhang L ， Hu X P ， Li Y M and Hu B . 2022 . A multi-modal open dataset for mental-disorder analysis . Scientific Data ， 9 （ 1 ）： # 178 ［ DOI： 10.1038/s41597-022-01211-x http://dx.doi.org/10.1038/s41597-022-01211-x ］

Chowdhery A ， Narang S ， Devlin J ， Bosma M ， Mishra G ， Roberts A ， Barham P ， Chung H W ， Sutton C ， Gehrmann S ， Schuh P ， Shi K S ， Tsvyashchenko S ， Maynez J ， Rao A ， Barnes P ， Tay Y ， Shazeer N ， Prabhakaran V ， Reif E ， Du N ， Hutchinson B ， Pope R ， Bradbury J ， Austin J ， Isard M ， Gur-Ari G ， Yin P C ， Duke T ， Levskaya A ， Ghemawat S ， Dev S ， Michalewski H ， Garcia X ， Misra V ， Robinson K ， Fedus L ， Zhou D ， Ippolito D ， Luan D ， Lim H ， Zoph B ， Spiridonov A ， Sepassi R ， Dohan D ， Agrawal S ， Omernick M ， Dai A M ， Pillai T S ， Pellat M ， Lewkowycz A ， Moreira E ， Child R ， Polozov O ， Lee K ， Zhou Z W ， Wang X Z ， Saeta B ， Diaz M ， Firat O ， Catasta M ， Wei J ， Meier-Hellstern K ， Eck D ， Dean J ， Petrov S and Fiedel N . 2022 . PaLM： scaling language modeling with pathways ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2204.02311.pdf https://arxiv.org/pdf/2204.02311.pdf

Cohn J F ， Kruez T S ， Matthews I ， Yang Y ， Nguyen M H ， Padilla M T ， Zhou F and De la Torre F . 2009 . Detecting depression from facial actions and vocal prosody // Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops . Amsterdam， the Netherlands ： IEEE： 1 - 7 ［ DOI： 10.1109/ACII.2009.5349358 http://dx.doi.org/10.1109/ACII.2009.5349358 ］

Cummins N ， Scherer S ， Krajewski J ， Schnieder S ， Epps J and Quatieri T F . 2015 . A review of depression and suicide risk assessment using speech analysis . Speech Communication ， 71 ： 10 - 49 ［ DOI： 10.1016/j.specom.2015.03.004 http://dx.doi.org/10.1016/j.specom.2015.03.004 ］

Degottex G ， Kane J ， Drugman T ， Raitio T and Scherer S . 2014 . COVAREP — A collaborative voice analysis repository for speech technologies // Proceedings of 2014 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Florence， Italy ： IEEE： 960 - 964 ［ DOI： 10.1109/ICASSP.2014.6853739 http://dx.doi.org/10.1109/ICASSP.2014.6853739 ］

Devlin J ， Chang M W ， Lee K and Toutanova K . 2019 . BERT： pre-training of deep bidirectional Transformers for language understanding ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/1810.04805.pdf https://arxiv.org/pdf/1810.04805.pdf

Dhall A ， Goecke R ， Ghosh S ， Joshi J ， Hoey J and Gedeon T . 2017 . From individual to group-level emotion recognition： EmotiW 5.0//Proceedings of the 19th ACM International Conference on Multimodal Interaction . Glasgow UK ： ACM ： 524 - 528 ［ DOI： 10.1145/3136755.3143004 http://dx.doi.org/10.1145/3136755.3143004 ］

Dhall A ， Goecke R ， Joshi J ， Hoey J and Gedeon T . 2016 . EmotiW 2016： video and group-level emotion recognition challenges // Proceedings of the 18th ACM International Conference on Multimodal Interaction . Tokyo， Japan ： ACM： 427 - 432 ［ DOI： 10.1145/2993148.2997638 http://dx.doi.org/10.1145/2993148.2997638 ］

Dhall A ， Goecke R ， Joshi J ， Wagner M and Gedeon T . 2013 . Emotion recognition in the wild challenge 2013//Proceedings of the 15th ACM on International Conference on Multimodal Interaction . Sydney， Australia ： ACM ： 509 - 516 ［ DOI： 10.1145/2522848.2531739 http://dx.doi.org/10.1145/2522848.2531739 ］

Dhall A ， Murthy O V R ， Goecke R ， Joshi J and Gedeon T . 2015 . Video and image based emotion recognition challenges in the wild： EmotiW 2015//Proceedings of 2015 ACM on International Conference on Multimodal Interaction . Seattle， USA ： ACM ： 423 - 426 ［ DOI： 10.1145/2818346.2829994 http://dx.doi.org/10.1145/2818346.2829994 ］

Dinkel H ， Wu M Y and Yu K . 2019 . Text-based depression detection： what triggers an alert ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/1904.05154.pdf https://arxiv.org/pdf/1904.05154.pdf

Ekman P . 1999 . Basic emotions //Dalgleish T and Power M J， eds. Handbook of Cognition and Emotion . New York， USA ： John Wiley and Sons： 45 - 60 ［ DOI： 10.1002/0470013494.ch3 http://dx.doi.org/10.1002/0470013494.ch3 ］

Esuli A and Sebastiani F . 2006 . SENTIWORDNET： a publicly available lexical resource for opinion mining // Proceedings of the 5th International Conference on Language Resources and Evaluation . Genoa， Italy ： European Language Resources Association （ELRA）： 417 - 422

Eyben F ， Wöllmer M and Schuller B . 2009 . OpenEAR — introducing the Munich open-source emotion and affect recognition toolkit // Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops . Amsterdam， the Netherlands ： IEEE： 1 - 6 ［ DOI： 10.1109/ACII.2009.5349350 http://dx.doi.org/10.1109/ACII.2009.5349350 ］

Eyben F ， Wöllmer M and Schuller B . 2010 . Opensmile： the munich versatile and fast open-source audio feature extractor // Proceedings of the 18th ACM International Conference on Multimedia . Firenze， Italy ： ACM： 1459 - 1462 ［ DOI： 10.1145/1873951.1874246 http://dx.doi.org/10.1145/1873951.1874246 ］

Fang M ， Peng S Y ， Liang Y J ， Hung C C and Liu S H . 2023 . A multimodal fusion model with multi-level attention mechanism for depression detection . Biomedical Signal Processing and Control ， 82 ： # 104561 ［ DOI： 10.1016/j.bspc.2022.104561 http://dx.doi.org/10.1016/j.bspc.2022.104561 ］

Fitzpatrick K K ， Darcy A and Vierhile M . 2017 . Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent （Woebot）： a randomized controlled trial . JMIR Mental Health ， 4 （ 2 ）： # 19 ［ DOI： 10.2196/mental.7785 http://dx.doi.org/10.2196/mental.7785 ］

Fournier J C ， DeRubeis R J ， Hollon S D ， Dimidjian S ， Amsterdam J D ， Shelton R C and Fawcett J . 2010 . Antidepressant drug effects and depression severity： a patient-level meta-analysis . JAMA ， 303 （ 1 ）： 47 - 53 ［ DOI： 10.1001/jama.2009.1943 http://dx.doi.org/10.1001/jama.2009.1943 ］

Gandhi A ， Adhvaryu K ， Poria S ， Cambria E and Hussain A . 2023 . Multimodal sentiment analysis： a systematic review of history， datasets， multimodal fusion methods， applications， challenges and future directions . Information Fusion ， 91 ： 424 - 444 ［ DOI： 10.1016/j.inffus.2022.09.025 http://dx.doi.org/10.1016/j.inffus.2022.09.025 ］

Gardiner P M ， McCue K D ， Negash L M ， Cheng T ， White L F ， Yinusa-Nyahkoon L ， Jack B W and Bickmore T W . 2017 . Engaging women with an embodied conversational agent to deliver mindfulness and lifestyle recommendations： a feasibility randomized control trial . Patient Education and Counseling ， 100 （ 9 ）： 1720 - 1729 ［ DOI： 10.1016/j.pec.2017.04.015 http://dx.doi.org/10.1016/j.pec.2017.04.015 ］

Ghorbanali A ， Sohrabi M K and Yaghmaee F . 2022 . Ensemble transfer learning-based multimodal sentiment analysis using weighted convolutional neural networks . Information Processing and Management ， 59 （ 3 ）： # 102929 ［ DOI： 10.1016/j.ipm.2022.102929 http://dx.doi.org/10.1016/j.ipm.2022.102929 ］

Gilbody S ， Richards D ， Brealey S and Hewitt C . 2007 . Screening for depression in medical settings with the patient health questionnaire （PHQ）： a diagnostic meta-analysis . Journal of General Internal Medicine ， 22 （ 11 ）： 1596 - 1602 ［ DOI： 10.1007/s11606-007-0333-y http://dx.doi.org/10.1007/s11606-007-0333-y ］

Gong Y and Poellabauer C . 2017 . Topic modeling based multi-modal depression detection // Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . Mountain View， USA ： ACM： 69 - 76 ［ DOI： 10.1145/3133944.3133945 http://dx.doi.org/10.1145/3133944.3133945 ］

Gratch J ， Artstein R ， Lucas G ， Stratou G ， Scherer S ， Nazarian A ， Wood R ， Boberg J ， DeVault D ， Marsella S ， Traum D ， Rizzo S and Morency L P . 2014 . The distress analysis interview corpus of human and computer interviews // Proceedings of the 9th International Conference on Language Resources and Evaluation . Reykjavik， Iceland ： European Language Resources Association （ELRA）： 3123 - 3128

Guo W T ， Yang H W ， Liu Z Y ， Xu Y P and Hu B . 2021 . Deep neural networks for depression recognition based on 2D and 3D facial expressions under emotional stimulus tasks . Frontiers in Neuroscience ， 15 ： # 609760 ［ DOI： 10.3389/fnins.2021.609760 http://dx.doi.org/10.3389/fnins.2021.609760 ］

Guo Y R ， Liu J L ， Wang L ， Qin W ， Hao S J and Hong R C . 2024 . A prompt-based topic-modeling method for depression detection on low-resource data . IEEE Transactions on Computational Social Systems ， 11 （ 1 ）： 1430 - 1439 ［ DOI： 10.1109/TCSS.2023.3260080 http://dx.doi.org/10.1109/TCSS.2023.3260080 ］

Han W ， Chen H and Poria S . 2021 . Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis // Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing . Online and Punta Cana， Dominican Republic ： Association for Computational Linguistics： 9180 - 9192 ［ DOI： 10.18653/v1/2021.emnlp-main.723 http://dx.doi.org/10.18653/v1/2021.emnlp-main.723 ］

Haque A ， Guo M ， Miner A S and Li F F . 2018 . Measuring depression symptom severity from spoken language and 3D facial expressions ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/1811.08592.pdf https://arxiv.org/pdf/1811.08592.pdf

He K M ， Zhang X Y ， Ren S Q and Sun J . 2016 . Deep residual learning for image recognition // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 770 - 778 ［ DOI： 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ］

He R D ， Lee W S ， Ng H T and Dahlmeier D . 2018 . Adaptive semi-supervised learning for cross-domain sentiment classification // Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing . Brussels， Belgium ： Association for Computational Linguistics： 3467 - 3476 ［ DOI： 10.18653/v1/D18-1383 http://dx.doi.org/10.18653/v1/D18-1383 ］

Hochreiter S and Schmidhuber J . 1997 . Long short-term memory . Neural Computation ， 9 （ 8 ）： 1735 - 1780 ［ DOI： 10.1162/neco.1997.9.8.1735 http://dx.doi.org/10.1162/neco.1997.9.8.1735 ］

Hoffmann J ， Borgeaud S ， Mensch A ， Buchatskaya E ， Cai T ， Rutherford E ， de Las Casas D ， Hendricks L A ， Welbl J ， Clark A ， Hennigan T ， Noland E ， Millican K ， van den Driessche G ， Damoc B ， Guy A ， Osindero S ， Simonyan K ， Elsen E ， Rae J W ， Vinyals O and Sifre L . 2022 . Training compute-optimal large language models ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2203.15556.pdf https://arxiv.org/pdf/2203.15556.pdf

Hu G M ， Lin T E ， Zhao Y ， Lu G M ， Wu Y C and Li Y B . 2022 . UniMSE： towards unified multimodal sentiment analysis and emotion recognition ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2211.11256.pdf https://arxiv.org/pdf/2211.11256.pdf

Hu Y ， Hou S J ， Yang H M ， Huang H and He L . 2023 . A joint network based on interactive attention for speech emotion recognition // Proceedings of 2023 IEEE International Conference on Multimedia and Expo （ICME） . Brisbane， Australia ： IEEE： 1715 - 1720 ［ DOI： 10.1109/ICME55011.2023.00295 http://dx.doi.org/10.1109/ICME55011.2023.00295 ］

Huang Z S ， Hu Q ， Gu J G ， Yang J ， Feng Y and Wang G . 2019 . Web-based intelligent agents for suicide monitoring and early warning . China Digital Medicine ， 14 （ 3 ）： 2 - 6

黄智生，胡青，顾进广，杨洁，冯媛，王刚 . 2019 . 网络智能机器人与自杀监控预警 . 中国数字医学， 14 （ 3 ）： 2 - 6 ［ DOI： 10.3969/j.issn.1673-7571.2019.03.001 http://dx.doi.org/10.3969/j.issn.1673-7571.2019.03.001 ］

Imbir K K . 2020 . Psychoevolutionary theory of emotion （Plutchik）//Zeigler-Hill V and Shackelford T K， eds. Encyclopedia of Personality and Individual Differences . Cham ： Springer ： 4137 - 4144 ［ DOI： 10.1007/978-3-319-24612-3_547 http://dx.doi.org/10.1007/978-3-319-24612-3_547 ］

Inkster B ， Sarda S and Subramanian V . 2018 . An empathy-driven， conversational artificial intelligence agent （Wysa） for digital mental well-being： real-world data evaluation mixed-methods study . JMIR mHealth and uHealth ， 6 （ 11 ）： # 12106 ［ DOI： 10.2196/12106 http://dx.doi.org/10.2196/12106 ］

Joshi J ， Goecke R ， Alghowinem S ， Dhall A ， Wagner M ， Epps J ， Parker G and Breakspear M . 2013 . Multimodal assistive technologies for depression diagnosis and monitoring . Journal on Multimodal User Interfaces ， 7 （ 3 ）： 217 - 228 ［ DOI： 10.1007/s12193-013-0123-2 http://dx.doi.org/10.1007/s12193-013-0123-2 ］

Joulin A ， Grave E ， Bojanowski P and Mikolov T . 2016 . Bag of tricks for efficient text classification ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/1607.01759.pdf https://arxiv.org/pdf/1607.01759.pdf

Kroenke K ， Spitzer R L and Williams J B . 2001 . The PHQ-9： validity of a brief depression severity measure . Journal of General Internal Medicine ， 16 （ 9 ）： 606 - 613 ［ DOI： 10.1046/j.1525-1497.2001.016009606.x http://dx.doi.org/10.1046/j.1525-1497.2001.016009606.x ］

Ku L W and Chen H H . 2007 . Mining opinions from the web： beyond relevance retrieval . Journal of the American Society for Information Science and Technology ， 58 （ 12 ）： 1838 - 1850 ［ DOI： 10.1002/asi.20630 http://dx.doi.org/10.1002/asi.20630 ］

Lai S N ， Hu X F ， Xu H X ， Ren Z X and Liu Z . 2023 . Multimodal sentiment analysis： a survey ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2305.07611.pdf https://arxiv.org/pdf/2305.07611.pdf

Lam G ， Huang D Y and Lin W S . 2019 . Context-aware deep learning for multi-modal depression detection // Proceedings of 2019 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Brighton， UK ： IEEE： 3946 - 3950 ［ DOI： 10.1109/ICASSP.2019.8683027 http://dx.doi.org/10.1109/ICASSP.2019.8683027 ］

Lei S L ， Dong G T ， Wang X P ， Wang K H and Wang S R . 2023 . InstructERC： reforming emotion recognition in conversation with a retrieval multi-task LLMs framework ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2309.11911.pdf https://arxiv.org/pdf/2309.11911.pdf

Li Y ， Tao J H ， Schuller B ， Shan S G ， Jiang D M and Jia J . 2016 . MEC 2016： the multimodal emotion recognition challenge of CCPR 2016//Proceedings of the 7th Chinese Conference on Pattern Recognition . Chengdu， China ： Springer ： 667 - 678 ［ DOI： 10.1007/978-981-10-3005-5_55 http://dx.doi.org/10.1007/978-981-10-3005-5_55 ］

Lian Z ， Liu B and Tao J H . 2021 . CTNet： conversational Transformer network for emotion recognition . IEEE/ACM Transactions on Audio， Speech， and Language Processing ， 29 ： 985 - 1000 ［ DOI： 10.1109/TASLP.2021.3049898 http://dx.doi.org/10.1109/TASLP.2021.3049898 ］

Lian Z ， Liu B and Tao J H . 2023a . SMIN： semi-supervised multi-modal interaction network for conversational emotion recognition . IEEE Transactions on Affective Computing ， 14 （ 3 ）： 2415 - 2429 ［ DOI： 10.1109/TAFFC.2022.3141237 http://dx.doi.org/10.1109/TAFFC.2022.3141237 ］

Lian Z ， Sun H Y ， Sun L C ， Chen K ， Xu M Y ， Wang K X ， Xu K ， He Y ， Li Y ， Zhao J M ， Liu Y ， Liu B ， Yi J Y ， Wang M ， Cambria E ， Zhao G Y ， Schuller B W and Tao J H . 2023b . MER 2023： multi-label learning， modality robustness， and semi-supervised learning ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2304.08981.pdf https://arxiv.org/pdf/2304.08981.pdf

Lian Z ， Sun L C ， Xu M Y ， Sun H Y ， Xu K ， Wen Z F ， Chen S ， Liu B and Tao J H . 2023c . Explainable multimodal emotion reasoning ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2306.15401.pdf https://arxiv.org/pdf/2306.15401.pdf

Lin L ， Chen X R ， Shen Y and Zhang L . 2020 . Towards automatic depression detection： a BiLSTM/1D CNN-based model . Applied Sciences ， 10 （ 23 ）： # 8701 ［ DOI： 10.3390/app10238701 http://dx.doi.org/10.3390/app10238701 ］

Littlewort G ， Whitehill J ， Wu T F ， Fasel I ， Frank M ， Movellan J and Bartlett M . 2011 . The computer expression recognition toolbox （CERT）//Proceedings of 2011 IEEE International Conference on Automatic Face and Gesture Recognition （FG） . Santa Barbara， USA ： IEEE ： 298 - 305 ［ DOI： 10.1109/FG.2011.5771414 http://dx.doi.org/10.1109/FG.2011.5771414 ］

Liu H T ， Li C Y ， Wu Q Y and Lee Y J . 2023 . Visual instruction tuning ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2304.08485.pdf https://arxiv.org/pdf/2304.08485.pdf

Liu P F ， Qiu X P and Huang X J . 2016 . Deep multi-task learning with shared memory ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/1609.07222.pdf https://arxiv.org/pdf/1609.07222.pdf

Liu T T ， Liu Z ， Chai Y J ， Wang J and Wang Y Y . 2021 . Agent affective computing in human-computer interaction . Journal of Image and Graphics ， 26 （ 12 ）： 2767 - 2777

刘婷婷，刘箴，柴艳杰，王瑾，王媛怡 . 2021 . 人机交互中的智能体情感计算研究 . 中国图象图形学报， 26 （ 12 ）： 2767 - 2777 ［ DOI： 10.11834/jig.200498 http://dx.doi.org/10.11834/jig.200498 ］

Ly K H ， Ly A M and Andersson G . 2017 . A fully automated conversational agent for promoting mental well-being： a pilot RCT using mixed methods . Internet Interventions ， 10 ： 39 - 46 ［ DOI： 10.1016/j.invent.2017.10.002 http://dx.doi.org/10.1016/j.invent.2017.10.002 ］

Ma X C ， Yang H Y ， Chen Q ， Huang D and Wang Y H . 2016 . DepAudioNet： an efficient deep model for audio based depression classification // Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . Amsterdam， the Netherlands ： ACM： 35 - 42 ［ DOI： 10.1145/2988257.2988267 http://dx.doi.org/10.1145/2988257.2988267 ］

McFee B ， Raffel C ， Liang D ， Ellis D ， McVicar M ， Battenberg E and Nieto O . 2015 . Librosa ： audio and music signal analysis in python //Proceedings of the 14th Python in Science Conference. 18 - 25 ［ DOI： 10.25080/majora-7b98e3ed-003 http://dx.doi.org/10.25080/majora-7b98e3ed-003 ］

Mehrabian A . 1996 . Pleasure-arousal-dominance： a general framework for describing and measuring individual differences in temperament . Current Psychology ， 14 （ 4 ）： 261 - 292 ［ DOI： 10.1007/BF02686918 http://dx.doi.org/10.1007/BF02686918 ］

Mendels G ， Levitan S ， Lee K Z and Hirschberg J . 2017 . Hybrid acoustic-lexical deep learning approach for deception detection // Interspeech 2017 . Stockholm， Sweden ： ISCA： 1472 - 1476 ［ DOI： 10.21437/Interspeech.2017-1723 http://dx.doi.org/10.21437/Interspeech.2017-1723 ］

Mikolov T ， Chen K ， Corrado G and Dean J . 2013 . Efficient estimation of word representations in vector space ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/1301.3781.pdf https://arxiv.org/pdf/1301.3781.pdf

Minsky M . 1988 . The Society of Mind . New York， USA ： Simon and Schuster

Mohammad S M and Turney P D . 2013 . NRC Emotion Lexicon . National Research Council of Canada ［ DOI： 10.4224/21270984 http://dx.doi.org/10.4224/21270984 ］

Morales M R ， Scherer S and Levitan R . 2017 . OpenMM： an open-source multimodal feature extraction tool // Interspeech 2017 . Stockholm， Sweden ： ISCA： 3354 - 3358 ［ DOI： 10.21437/Interspeech.2017-1382 http://dx.doi.org/10.21437/Interspeech.2017-1382 ］

Pasikowska A ， Zaraki A and Lazzeri N . 2013 . A dialogue with a virtual imaginary interlocutor as a form of a psychological support for well-being // Proceedings of the International Conference on Multimedia， Interaction， Design and Innovation . Warsaw Poland ： ACM： 1 - 15 ［ DOI： 10.1145/2500342.2500359 http://dx.doi.org/10.1145/2500342.2500359 ］

Pennington J ， Socher R and Manning C . 2014 . GloVe： global vectors for word representation // Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing （EMNLP） . Doha， Qatar ： Association for Computational Linguistics： 1532 - 1543 ［ DOI： 10.3115/v1/D14-1162 http://dx.doi.org/10.3115/v1/D14-1162 ］

Pham H ， Liang P P ， Manzini T ， Morency L P and Póczos B . 2019 . Found in translation： learning robust joint representations by cyclic translations between modalities // Proceedings of the 33rd AAAI Conference on Artificial Intelligence . Honolulu， USA ： AAAI： 6892 - 6899 ［ DOI： 10.1609/aaai.v33i01.33016892 http://dx.doi.org/10.1609/aaai.v33i01.33016892 ］

Poria S ， Cambria E and Gelbukh A . 2015 . Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis // Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing . Lisbon， Portugal ： Association for Computational Linguistics： 2539 - 2544 ［ DOI： 10.18653/v1/D15-1303 http://dx.doi.org/10.18653/v1/D15-1303 ］

Poria S ， Hazarika D ， Majumder N ， Naik G ， Cambria E and Mihalcea R . 2019 . MELD： a multimodal multi-party dataset for emotion recognition in conversations // Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . Florence， Italy ： Association for Computational Linguistics： 527 - 536 ［ DOI： 10.18653/v1/P19-1050 http://dx.doi.org/10.18653/v1/P19-1050 ］

Radford A ， Kim J W ， Hallacy C ， Ramesh A ， Goh G ， Agarwal S ， Sastry G ， Askell A ， Mishkin P ， Clark J ， Krueger G and Sutskever I . 2021 . Learning transferable visual models from natural language supervision // Proceedings of the 38th International Conference on Machine Learning . PMLR ： 139 ： 8748 - 8763

Ringeval F ， Schuller B ， Valstar M ， Cowie R ， Kaya H ， Schmitt M ， Amiriparian S ， Cummins N ， Lalanne D ， Michaud A ， Ciftçi E ， Güleç H ， Salah A A and Pantic M . 2018 . AVEC 2018 workshop and challenge： bipolar disorder and cross-cultural affect recognition // Proceedings of 2018 on Audio/Visual Emotion Challenge and Workshop . Seoul， Korea （South）： ACM： 3 - 13 ［ DOI： 10.1145/3266302.3266316 http://dx.doi.org/10.1145/3266302.3266316 ］

Rizzo A A ， Lange B ， Buckwalter J G ， Forbell E ， Kim J ， Sagae K ， Williams J ， Rothbaum B O ， Difede J ， Reger G ， Parsons T and Kenny P . 2011 . An intelligent virtual human system for providing healthcare information and support . Studies in Health Technology and Informatics ， 163 ： 503 - 509

Ruggiero K J ， Ben K D ， Scotti J R and Rabalais A E . 2003 . Psychometric properties of the PTSD checklist—civilian version . Journal of Traumatic Stress ， 16 （ 5 ）： 495 - 502 ［ DOI： 10.1023/A：1025714729117 http://dx.doi.org/10.1023/A：1025714729117 ］

Rush A J ， Carmody T J ， Ibrahim H M ， Trivedi M H ， Biggs M M ， Shores-Wilson K ， Crismon M L ， Toprac M G and Kashner T M . 2006 . Comparison of self-report and clinician ratings on two inventories of depressive symptomatology . Psychiatric Services ， 57 （ 6 ）： 829 - 837 ［ DOI： 10.1176/ps.2006.57.6.829 http://dx.doi.org/10.1176/ps.2006.57.6.829 ］

Scherer S ， Stratou G ， Gratch J and Morency L P . 2013 . Investigating voice quality as a speaker-independent indicator of depression and PTSD // Interspeech 2013 . Lyon， France ：［s.n.］： 847 - 851 ［ DOI： 10.21437/Interspeech.2013-240 http://dx.doi.org/10.21437/Interspeech.2013-240 ］

Scherer S ， Stratou G ， Lucas G ， Mahmoud M ， Boberg J ， Gratch J ， Rizzo A and Morency L P . 2014 . Automatic audiovisual behavior descriptors for psychological disorder analysis . Image and Vision Computing ， 32 （ 10 ）： 648 - 658 ［ DOI： 10.1016/j.imavis.2014.06.001 http://dx.doi.org/10.1016/j.imavis.2014.06.001 ］

Schroff F ， Kalenichenko D and Philbin J . 2015 . FaceNet： a unified embedding for face recognition and clustering // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston， USA ： IEEE： 815 - 823 ［ DOI： 10.1109/CVPR.2015.7298682 http://dx.doi.org/10.1109/CVPR.2015.7298682 ］

Schuller B ， Valstar M ， Eyben F ， McKeown G ， Cowie R and Pantic M . 2011 . Avec 2011–the first international audio/visual emotion challenge // Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction . Memphis， USA ： Springer： 415 - 424 ［ DOI： 10.1007/978-3-642-24571-8_53 http://dx.doi.org/10.1007/978-3-642-24571-8_53 ］

Sebe N ， Cohen I ， Gevers T and Huang T S . 2005 . Multimodal approaches for emotion recognition： a survey // Proceedings Volume 5670 ， Internet Imaging VI. San Jose， USA ： SPIE： 56 - 67 ［ DOI： 10.1117/12.600746 http://dx.doi.org/10.1117/12.600746 ］

Shaver P ， Schwartz J ， Kirson D and O’Connor C . 1987 . Emotion knowledge： further exploration of a prototype approach . Journal of Personality and Social Psychology ， 52 （ 6 ）： 1061 - 1086 ［ DOI： 10.1037//0022-3514.52.6.1061 http://dx.doi.org/10.1037//0022-3514.52.6.1061 ］

Shen Y ， Yang H Y and Lin L . 2022 . Automatic depression detection： an emotional audio-textual corpus and a GRU/BiLSTM-based model // Proceedings of 2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Singapore， Singapore ： IEEE： 6247 - 6251 ［ DOI： 10.1109/ICASSP43922.2022.9746569 http://dx.doi.org/10.1109/ICASSP43922.2022.9746569 ］

Shott S . 1979 . Emotion and social life： a symbolic interactionist analysis . American Journal of Sociology ， 84 （ 6 ）： 1317 - 1334 ［ DOI： 10.1086/226936 http://dx.doi.org/10.1086/226936 ］

Soleymani M ， Garcia D ， Jou B ， Schuller B ， Chang S F and Pantic M . 2017 . A survey of multimodal sentiment analysis . Image and Vision Computing ， 65 ： 3 - 14 ［ DOI： 10.1016/j.imavis.2017.08.003 http://dx.doi.org/10.1016/j.imavis.2017.08.003 ］

Spek V ， Cuijpers P ， Nyklícek I ， Riper H ， Keyzer J and Pop V . 2007 . Internet-based cognitive behaviour therapy for symptoms of depression and anxiety： a meta-analysis . Psychological Medicine ， 37 （ 3 ）： 319 - 328 ［ DOI： 10.1017/S0033291706008944 http://dx.doi.org/10.1017/S0033291706008944 ］

Su W J ， Zhu X Z ， Cao Y ， Li B ， Lu L W ， Wei F R and Dai J F . 2020 . VL-BERT： pre-training of generic visual-linguistic representations ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/1908.08530.pdf https://arxiv.org/pdf/1908.08530.pdf

Su Y X ， Lan T ， Li H Y ， Xu J L ， Wang Y and Cai D . 2023 . PandaGPT： one model to instruction-follow them all ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2305.16355.pdf https://arxiv.org/pdf/2305.16355.pdf

Sun B ， Zhang Y H ， He J ， Yu L J ， Xu Q H ， Li D L and Wang Z Y . 2017 . A random forest regression method with selected-text feature for depression assessment // Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . Mountain View， USA ： ACM： 61 - 68 ［ DOI： 10.1145/3133944.3133951 http://dx.doi.org/10.1145/3133944.3133951 ］

Sun S T ， Chen H Y ， Shao X X ， Liu L L ， Li X W and Hu B . 2020 . EEG based depression recognition by combining functional brain network and traditional biomarkers // Proceedings of 2020 IEEE International Conference on Bioinformatics and Biomedicine . Seoul， Korea （South）： IEEE： 2074 - 2081 ［ DOI： 10.1109/BIBM49941.2020.9313270 http://dx.doi.org/10.1109/BIBM49941.2020.9313270 ］

Tomkins S S . 1962 . Affect Imagery Consciousness： Volume I： The Positive Affects . New York， USA ： Springer

Torous J ， Chan S R ， Tan S Y M ， Behrens J ， Mathew I ， Conrad E J ， Hinton L ， Yellowlees P and Keshavan M . 2014 . Patient smartphone ownership and interest in mobile apps to monitor symptoms of mental health conditions： a survey in four geographically distinct psychiatric clinics . JMIR Mental Health ， 1 （ 1 ）： # 5 ［ DOI： 10.2196/mental.4004 http://dx.doi.org/10.2196/mental.4004 ］

Valstar M ， Schuller B ， Smith K ， Eyben F ， Jiang B H ， Bilakhia S ， Schnieder S ， Cowie R and Pantic M . 2013 . AVEC 2013： the continuous audio/visual emotion and depression recognition challenge // Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge . Barcelona， Spain ： ACM： 3 - 10 ［ DOI： 10.1145/2512530.2512533 http://dx.doi.org/10.1145/2512530.2512533 ］

Wang D ， Guo X T ， Tian Y M ， Liu J H ， He L H and Luo X M . 2023 . TETFN： a text enhanced Transformer fusion network for multimodal sentiment analysis . Pattern Recognition ， 136 ： # 109259 ［ DOI： 10.1016/j.patcog.2022.109259 http://dx.doi.org/10.1016/j.patcog.2022.109259 ］

Weizenbaum J . 1966 . ELIZA — a computer program for the study of natural language communication between man and machine . Communications of the ACM ， 9 （ 1 ）： 36 - 45 ［ DOI： 10.1145/365153.365168 http://dx.doi.org/10.1145/365153.365168 ］

Williamson J R ， Godoy E ， Cha M ， Schwarzentruber A ， Khorrami P ， Gwon Y ， Kung H T ， Dagli C and Quatieri T F . 2016 . Detecting depression using vocal， facial and semantic communication cues // Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . Amsterdam， the Netherlands ： ACM： 11 - 18 ［ DOI： 10.1145/2988257.2988263 http://dx.doi.org/10.1145/2988257.2988263 ］

World Health Organization . 2020a . Depression 2020 a ［EB/OL］. ［ 2023-12-23 ］. https://www.who.int/health-topics/depression https://www.who.int/health-topics/depression

World Health Organization . 2020b . Mental health in China 2020 b ［EB/OL］. ［ 2023-12-23 ］. https://www.who.int/china/health-topics/mental-health https://www.who.int/china/health-topics/mental-health

Wu S X ， Dai D M ， Qin Z W ， Liu T Y ， Lin B H ， Cao Y B and Sui Z F . 2023 . Denoising bottleneck with mutual information maximization for video multimodal fusion ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2305.14652.pdf https://arxiv.org/pdf/2305.14652.pdf

Wu Y ， Zhao Y Y ， Yang H ， Chen S ， Qin B ， Cao X H and Zhao W T . 2022 . Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2203.00257.pdf https://arxiv.org/pdf/2203.00257.pdf

Xiao J Q and Luo X X . 2022 . A survey of sentiment analysis based on multi-modal information // Proceedings of 2022 IEEE Asia-Pacific Conference on Image Processing， Electronics and Computers （IPEC） . Dalian， China ： IEEE： 712 - 715 ［ DOI： 10.1109/IPEC54454.2022.9777333 http://dx.doi.org/10.1109/IPEC54454.2022.9777333 ］

Xu L H ， Lin H F ， Pan Y ， Ren H and Chen J M . 2008 . Constructing the affective lexicon ontology . Journal of the China Society for Scientific and Technical Information ， 27 （ 2 ）： 180 - 185

徐琳宏，林鸿飞，潘宇，任惠，陈建美 . 2008 . 情感词汇本体的构造 . 情报学报， 27 （ 2 ）： 180 - 185 ［ DOI： 10.3969/j.issn.1000-0135.2008.02.004 http://dx.doi.org/10.3969/j.issn.1000-0135.2008.02.004 ］

Yang B ， Wu L J ， Zhu J H ， Shao B ， Lin X L and Liu T Y . 2022 . Multimodal sentiment analysis with two-phase multi-task learning . IEEE/ACM Transactions on Audio， Speech， and Language Processing ， 30 ： 2015 - 2024 ［ DOI： 10.1109/TASLP.2022.3178204 http://dx.doi.org/10.1109/TASLP.2022.3178204 ］

Yang L ， Jiang D M ， He L ， Pei E C ， Oveneke M C and Sahli H . 2016 . Decision tree based depression classification from audio video and language information // Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . Amsterdam， the Netherlands ： ACM： 89 - 96 ［ DOI： 10.1145/2988257.2988269 http://dx.doi.org/10.1145/2988257.2988269 ］

Yang L ， Jiang D M ， Xia X H ， Pei E C ， Oveneke M C and Sahli H . 2017 . Multimodal measurement of depression using deep learning models // Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . Mountain View， USA ： ACM： 53 - 59 ［ DOI： 10.1145/3133944.3133948 http://dx.doi.org/10.1145/3133944.3133948 ］

Yang Y ， Fairbairn C and Cohn J F . 2013 . Detecting depression severity from vocal prosody . IEEE Transactions on Affective Computing ， 4 （ 2 ）： 142 - 150 ［ DOI： 10.1109/T-AFFC.2012.38 http://dx.doi.org/10.1109/T-AFFC.2012.38 ］

Yap M H ， See J ， Hong X P and Wang S J . 2018 . Facial micro-expressions grand challenge 2018 summary//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition （FG 2018 ）. Xi’an ， China ： IEEE： 675 - 678 ［ DOI： 10.1109/FG.2018.00106 http://dx.doi.org/10.1109/FG.2018.00106 ］

Ye J Y ， Yu Y H ， Wang Q X ， Li W T ， Liang H ， Zheng Y S and Fu G . 2021 . Multi-modal depression detection based on emotional audio and evaluation text . Journal of Affective Disorders ， 295 ： 904 - 913 ［ DOI： 10.1016/j.jad.2021.08.090 http://dx.doi.org/10.1016/j.jad.2021.08.090 ］

Yi G F ， Yang Y G ， Pan Y ， Cao Y H ， Yao J X ， Lv X ， Fan C H ， Lv Z ， Tao J H ， Liang S and Lu H . 2023 . Exploring the power of cross-contextual large language model in mimic emotion prediction // Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop： Mimicked Emotions， Humour and Personalisation . Ottawa， Canada ： Association for Computing Machinery： 19 - 26 ［ DOI： 10.1145/3606039.3613109 http://dx.doi.org/10.1145/3606039.3613109 ］

Yin S ， Liang C ， Ding H Y and Wang S F . 2019 . A multi-modal hierarchical recurrent neural network for depression detection // Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop . Nice， France ： ACM： 65 - 71 ［ DOI： 10.1145/3347320.3357696 http://dx.doi.org/10.1145/3347320.3357696 ］

Yu H L ， Gui L K ， Madaio M ， Ogan A ， Cassell J and Morency L P . 2017 . Temporally selective attention model for social and affective state recognition in multimedia content // Proceedings of the 25th ACM international conference on Multimedia . Mountain View， USA ： ACM： 1743 - 1751 ［ DOI： 10.1145/3123266.3123413 http://dx.doi.org/10.1145/3123266.3123413 ］

Yu W M ， Xu H ， Meng F Y ， Zhu Y L ， Ma Y X ， Wu J L ， Zou J Y and Yang K C . 2020 . CH-SIMS： a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Online ： Association for Computational Linguistics： 3718 - 3727 ［ DOI： 10.18653/v1/2020.acl-main.343 http://dx.doi.org/10.18653/v1/2020.acl-main.343 ］

Yu W M ， Xu H ， Yuan Z Q and Wu J L . 2021 . Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis //Proceedings of the 35th AAAI Conference on Artificial Intelligence. ［s.l.］： AAAI： 10790 - 10797 ［ DOI： 10.1609/aaai.v35i12.17289 http://dx.doi.org/10.1609/aaai.v35i12.17289 ］

Zadeh A ， Chen M H ， Poria S ， Cambria E and Morency L P . 2017a . Tensor fusion network for multimodal sentiment analysis ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/1707.07250.pdf https://arxiv.org/pdf/1707.07250.pdf

Zadeh A ， Chen M H ， Poria S ， Cambria E and Morency L P . 2017b . Tensor fusion network for multimodal sentiment analysis // Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing . Copenhagen， Denmark ： Association for Computational Linguistics： 1103 - 1114 ［ DOI： 10.18653/v1/D17-1115 http://dx.doi.org/10.18653/v1/D17-1115 ］

Zadeh A A B ， Liang P P ， Poria S ， Cambria E and Morency L P . 2018a . Multimodal language analysis in the wild： CMU-MOSEI dataset and interpretable dynamic fusion graph // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers） . Melbourne， Australia ： Association for Computational Linguistics： 2236 - 2246 ［ DOI： 10.18653/v1/P18-1208 http://dx.doi.org/10.18653/v1/P18-1208 ］

Zhang F ， Li X C ， Lim C P ， Hua Q ， Dong C R and Zhai J H . 2022 . Deep emotional arousal network for multimodal sentiment analysis and emotion recognition . Information Fusion ， 88 ： 296 - 304 ［ DOI： 10.1016/j.inffus.2022.07.006 http://dx.doi.org/10.1016/j.inffus.2022.07.006 ］

Zhang J ， Xue S Y ， Wang X Y and Liu J . 2023 . Survey of multimodal sentiment analysis based on deep learning // Proceedings of the 9th IEEE International Conference on Cloud Computing and Intelligent Systems （CCIS） . Dali， China ： IEEE： 446 - 450 ［ DOI： 10.1109/CCIS59572.2023.10263012 http://dx.doi.org/10.1109/CCIS59572.2023.10263012 ］

Zhang P Y ， Wu M Y ， Dinkel H and Yu K . 2021 . DEPA： self-supervised audio embedding for depression detection // Proceedings of the 29th ACM International Conference on Multimedia . Chengdu， China ： ACM： 135 - 143 ［ DOI： 10.1145/3474085.3479236 http://dx.doi.org/10.1145/3474085.3479236 ］

Zhao J M ， Zhang T G ， Hu J W ， Liu Y C ， Jin Q ， Wang X C and Li H Z . 2022 . M 3 ED： multi-modal multi-scene multi-label emotional dialogue database // Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers） . Dublin， Ireland ： Association for Computational Linguistics： 5699 - 5710 ［ DOI： 10.18653/v1/2022.acl-long.391 http://dx.doi.org/10.18653/v1/2022.acl-long.391 ］

Zhu D Y ， Chen J ， Shen X Q ， Li X and Elhoseiny M . 2023a . MiniGPT-4： enhancing vision-language understanding with advanced large language models ［EB/OL］. ［ 2023-12-23 ］. https://arxiv.org/pdf/2304.10592.pdf https://arxiv.org/pdf/2304.10592.pdf

Zhu L N ， Zhu Z C ， Zhang C W ， Xu Y F and Kong X J . 2023b . Multimodal sentiment analysis based on fusion methods： a survey . Information Fusion ， 95 ： 306 - 325 ［ DOI： 10.1016/j.inffus.2023.02.028 http://dx.doi.org/10.1016/j.inffus.2023.02.028 ］

Zou B C ， Han J L ， Wang Y X ， Liu R ， Zhao S H ， Feng L ， Lyu X W and Ma H M . 2023 . Semi-structural interview-based Chinese multimodal depression corpus towards automatic preliminary screening of depressive disorders . IEEE Transactions on Affective Computing ， 14 （ 4 ）： 2823 - 2838 ［ DOI： 10.1109/TAFFC.2022.3181210 http://dx.doi.org/10.1109/TAFFC.2022.3181210 ］

文章被引用时，请邮件提醒。

提交