多模态情感识别与理解发展现状及趋势
Development of multimodal sentiment recognition and understanding
- 2024年29卷第6期 页码:1607-1627
纸质出版日期: 2024-06-16
DOI: 10.11834/jig.240017
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-06-16 ,
移动端阅览
陶建华, 范存航, 连政, 吕钊, 沈莹, 梁山. 2024. 多模态情感识别与理解发展现状及趋势. 中国图象图形学报, 29(06):1607-1627
Tao Jianhua, Fan Cunhang, Lian Zheng, Lyu Zhao, Shen Ying, Liang Shan. 2024. Development of multimodal sentiment recognition and understanding. Journal of Image and Graphics, 29(06):1607-1627
情感计算是人工智能领域的一个重要分支,在交互、教育、安全和金融等众多领域应用广泛。单纯依靠语音、视频单一模态的情感识别并不符合人类对情感的感知模式,在受到干扰的情况下识别准确率会迅速下降。为了充分挖掘不同模态数据的互补性,多模态融合的情感识别研究正日益受到研究人员的广泛重视。本文分别从多模态情感识别概述、多模态情感识别与理解、抑郁症情感障碍检测及干预3个维度介绍多模态情感计算研究现状。本文认为具备可扩展性的情感特征设计、基于大模型迁移学习的识别方法将是未来的发展方向,并在解决抑郁、焦虑等情感障碍方面的作用日益凸显。
Affective computing is an important branch in the field of artificial intelligence (AI). It aims to build a computational system that can automatically perceive, recognize, understand, and provide feedback on human emotions. It involves the intersection of multiple disciplines such as computer science, neuroscience, psychology, and social science. Deep emotional understanding and interaction can enable computers to better understand and respond to human emotional needs. It can also provide personalized interactions and feedback based on emotional states, which enhances the human-computer interaction experience. It has various applications in areas such as intelligent assistants, virtual reality, and smart healthcare. Relying solely on single-modal information, such as speech signal or video, does not align with the way humans perceive emotions. The accuracy of recognition rapidly decreases when faced with interference. Multimodal emotion understanding and interaction technologies aim to fully model multidimensional information from audio, video, and physiological signals to achieve more accurate emotion understanding. This technology is fundamental and an important prerequisite for achieving natural, human-like, and personalized human-computer interaction. It holds significant value for ushering in the era of intelligence and digitalization. Multimodal fusion for sentiment recognition receives increasing attention from researchers in fully exploiting the complementary nature of different modalities. This study introduces the current research status of multimodal sentiment computation from three dimensions: an overview of multimodal sentiment recognition, multimodal sentiment understanding, and detection and assessment of emotional disorders such as depression. The overview of emotion recognition is elaborated from the aspects of academic definition, mainstream datasets, and international competitions. In recent years, large language models (LLMs) have demonstrated excellent modeling capabilities and achieved great success in the field of natural language processing with their outstanding language understanding and reasoning abilities. LLMs have garnered widespread attention because of their ability to handle various complex tasks by understanding prompts with minimal or zero-shot learning. Through methods such as self-supervised learning or contrastive learning, LLMs can learn more expressive multimodal representations, which can capture the correlations between different modalities and emotional information. Multimodal sentiment recognition and understanding are discussed in terms of emotion feature extraction, multimodal fusion, and the representation and models involved in sentiment recognition under the background of pre-trained large models. With the rapid development of society, people are facing increasing pressure, which can lead to feelings of depression, anxiety, and other negative emotions. Those who are in a prolonged state of depression and anxiety are more likely to develop mental illnesses. Depression is a common and serious condition, with symptoms including low mood, poor sleep quality, loss of appetite, fatigue, and difficulty concentrating. Depression not only harms individuals and families but also causes significant economic losses to society. The detection of emotional disorders starts from specific applications, which selects depression as the most common emotional disorder. We analyze its latest developments and trends from the perspectives of assessment and intervention. In addition, this study provides a detailed comparison of the research status of affective computation domestically, and prospects for future development trends are offered. We believe that scalable emotion feature design and large-scale model transfer learning based methods will be the future directions of development. The main challenge in multimodal emotion recognition lies in data scarcity, which means that data available to build and explore complex models are insufficient. This insufficiency causes difficulty in creating robust models based on deep neural network methods. The above mentioned issues can be addressed by constructing large-scale multimodal emotion databases and exploring transfer learning methods based on large models. By transferring knowledge learned from unsupervised tasks or other tasks to emotion recognition tasks, the problem of limited data resources can be alleviated. The use of explicit discrete and dimensional labels to represent ambiguous emotional states has limitations due to the inherent fuzziness of emotions. Enhancing the interpretability of prediction results to improve the reliability of recognition results is also an important research direction for the future. The role of multimodal emotion computing in addressing emotional disorders such as depression and anxiety is increasingly prominent. Future research can be conducted in the following three areas. First, research and construction of multimodal emotion disorder datasets can provide a solid foundation for the automatic recognition of emotional disorders. However, this field still needs to address challenges such as data privacy and ethics. In addition, considerations such as designing targeted interview questions, ensuring patient safety during data collection, and sample augmentation through algorithms are still worth exploring. Second, more effective algorithms should be developed. Emotional disorders fall within the psychological domain, and they can also affect the physiological features of patients, such as voice and body movements. This psychological-physiological correlation is worthy of comprehensive exploration. Therefore, improving the accuracy of algorithms for multimodal emotion disorder recognition is a pressing research issue. Finally, intelligent psychological intervention systems should be designed and implemented. The following issues can be further studied: effectively simulating the counseling process of a psychologist, promptly receiving user emotional feedback, and generating empathetic conversations.
情感识别多模态融合人机交互抑郁状态评估情感障碍干预认知行为疗法
sentiment recognitionmultimodel fusionhuman-computer interactiondepression detectionemotion disorder interventioncognitive behavior therapy
Ahmed A, Ali N, Aziz S, Abd-Alrazaq A A, Hassan A, Khalifa M, Elhusein B, Ahmed M, Ahmed M A S and Househ M. 2021. A review of mobile chatbot apps for anxiety and depression and their self-care features. Computer Methods and Programs in Biomedicine Update, 1: #3100012 [DOI: 10.1016/j.cmpbup.2021.100012http://dx.doi.org/10.1016/j.cmpbup.2021.100012]
Alghowinem S, Goecke R, Wagner M, Epps J, Gedeon T, Breakspear M and Parker G. 2013. A comparative study of different classifiers for detecting depression from spontaneous speech//Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: Canada: IEEE: 8022-8026 [DOI: 10.1109/ICASSP.2013.6639227http://dx.doi.org/10.1109/ICASSP.2013.6639227]
Alhanai T, Ghassemi M and Glass J. 2018. Detecting depression with audio/text sequence modeling of interviews//Interspeech 2018. Hyderabad, India: [s.n.]: 1716-1720 [DOI: 10.21437/Interspeech.2018-2522http://dx.doi.org/10.21437/Interspeech.2018-2522]
Amos B, Ludwiczuk B and Satyanarayanan M. 2016. OpenFace: a general-purpose face recognition library with mobile applications. CMU School of Computer Science, 6(2): #20
Andersson G and Cuijpers P. 2009. Internet-based and other computerized psychological treatments for adult depression: a meta-analysis. Cognitive Behaviour Therapy, 38(4): 196-205 [DOI: 10.1080/16506070903318960http://dx.doi.org/10.1080/16506070903318960]
Ando A, Masumura R, Takashima A, Suzuki S, Makishima N, Suzuki K, Moriya K, Ashihara T and Sato H. 2022. On the use of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis//Proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT). Doha, Qatar: IEEE: 739-746 [DOI: 10.1109/SLT54892.2023.10022548http://dx.doi.org/10.1109/SLT54892.2023.10022548]
Arroll B, Smith F G, Kerse N, Fishman T and Gunn J. 2005. Effect of the addition of a ‘help’ question to two screening questions on specificity for diagnosis of depression in general practice: diagnostic validity study. BMJ, 331(7521): #884 [DOI: 10.1136/bmj.38607.464537.7Chttp://dx.doi.org/10.1136/bmj.38607.464537.7C]
Bakker D, Kazantzis N, Rickwood D and Rickard N. 2016. Mental health smartphone apps: review and evidence-based recommendations for future developments. JMIR Mental Health, 3(1): #4984 [DOI: 10.2196/mental.4984http://dx.doi.org/10.2196/mental.4984]
Bao H B, Dong L, Wei F R, Wang W H, Yang N, Liu X D, Wang Y, Piao S H, Gao J F, Zhou M and Hon H W. 2020. UniLMv2: pseudo-masked language models for unified language model pre-training//Proceedings of the 37th International Conference on Machine Learning. [s.l.]: JMLR.org: 642-652
Barak A, Hen L, Boniel-Nissim M and Shapira N. 2008. A comprehensive review and a meta-analysis of the effectiveness of internet-based psychotherapeutic interventions. Journal of Technology in Human Services, 26(2/4): 109-160 [DOI: 10.1080/15228830802094429http://dx.doi.org/10.1080/15228830802094429]
Bell C C. 1994. DSM-IV: diagnostic and statistical manual of mental disorders. JAMA, 272(10): 828-829 [DOI: 10.1001/jama.1994.03520100096046http://dx.doi.org/10.1001/jama.1994.03520100096046]
Bhakta R, Savin-Baden M and Tombs G. 2014. Sharing secrets with robots?//Proceedings of 2014 World Conference on Educational Multimedia, Hypermedia and Telecommunications. Chesapeake, VA, USA: Association for the Advancement of Computing in Education (AACE): 2295-2301
Bickmore T W, Mitchell S E, Jack B W, Paasche-Orlow M K, Pfeifer L M and Odonnell J. 2010. Response to a relational agent by hospital patients with depressive symptoms. Interacting with Computers, 22(4): 289-298 [DOI: 10.1016/j.intcom.2009.12.001http://dx.doi.org/10.1016/j.intcom.2009.12.001]
Busso C, Bulut M, Lee C C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S and Narayanan S N. 2008. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4): 335-359 [DOI: 10.1007/s10579-008-9076-6http://dx.doi.org/10.1007/s10579-008-9076-6]
Cai H S, Yuan Z Q, Gao Y W, Sun S T, Li N, Tian F Z, Xiao H, Li J X, Yang Z W, Li X W, Zhao Q L, Liu Z Y, Yao Z J, Yang M Q, Peng H, Zhu J, Zhang X W, Gao G P, Zheng F, Li R, Guo Z H, Ma R, Yang J, Zhang L, Hu X P, Li Y M and Hu B. 2022. A multi-modal open dataset for mental-disorder analysis. Scientific Data, 9(1): #178 [DOI: 10.1038/s41597-022-01211-xhttp://dx.doi.org/10.1038/s41597-022-01211-x]
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung H W, Sutton C, Gehrmann S, Schuh P, Shi K S, Tsvyashchenko S, Maynez J, Rao A, Barnes P, Tay Y, Shazeer N, Prabhakaran V, Reif E, Du N, Hutchinson B, Pope R, Bradbury J, Austin J, Isard M, Gur-Ari G, Yin P C, Duke T, Levskaya A, Ghemawat S, Dev S, Michalewski H, Garcia X, Misra V, Robinson K, Fedus L, Zhou D, Ippolito D, Luan D, Lim H, Zoph B, Spiridonov A, Sepassi R, Dohan D, Agrawal S, Omernick M, Dai A M, Pillai T S, Pellat M, Lewkowycz A, Moreira E, Child R, Polozov O, Lee K, Zhou Z W, Wang X Z, Saeta B, Diaz M, Firat O, Catasta M, Wei J, Meier-Hellstern K, Eck D, Dean J, Petrov S and Fiedel N. 2022. PaLM: scaling language modeling with pathways [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2204.02311.pdfhttps://arxiv.org/pdf/2204.02311.pdf
Cohn J F, Kruez T S, Matthews I, Yang Y, Nguyen M H, Padilla M T, Zhou F and De la Torre F. 2009. Detecting depression from facial actions and vocal prosody//Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. Amsterdam, the Netherlands: IEEE: 1-7 [DOI: 10.1109/ACII.2009.5349358http://dx.doi.org/10.1109/ACII.2009.5349358]
Cummins N, Scherer S, Krajewski J, Schnieder S, Epps J and Quatieri T F. 2015. A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71: 10-49 [DOI: 10.1016/j.specom.2015.03.004http://dx.doi.org/10.1016/j.specom.2015.03.004]
Degottex G, Kane J, Drugman T, Raitio T and Scherer S. 2014. COVAREP — A collaborative voice analysis repository for speech technologies//Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Florence, Italy: IEEE: 960-964 [DOI: 10.1109/ICASSP.2014.6853739http://dx.doi.org/10.1109/ICASSP.2014.6853739]
Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: pre-training of deep bidirectional Transformers for language understanding [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/1810.04805.pdfhttps://arxiv.org/pdf/1810.04805.pdf
Dhall A, Goecke R, Ghosh S, Joshi J, Hoey J and Gedeon T. 2017. From individual to group-level emotion recognition: EmotiW 5.0//Proceedings of the 19th ACM International Conference on Multimodal Interaction. Glasgow UK: ACM: 524-528 [DOI: 10.1145/3136755.3143004http://dx.doi.org/10.1145/3136755.3143004]
Dhall A, Goecke R, Joshi J, Hoey J and Gedeon T. 2016. EmotiW 2016: video and group-level emotion recognition challenges//Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo, Japan: ACM: 427-432 [DOI: 10.1145/2993148.2997638http://dx.doi.org/10.1145/2993148.2997638]
Dhall A, Goecke R, Joshi J, Wagner M and Gedeon T. 2013. Emotion recognition in the wild challenge 2013//Proceedings of the 15th ACM on International Conference on Multimodal Interaction. Sydney, Australia: ACM: 509-516 [DOI: 10.1145/2522848.2531739http://dx.doi.org/10.1145/2522848.2531739]
Dhall A, Murthy O V R, Goecke R, Joshi J and Gedeon T. 2015. Video and image based emotion recognition challenges in the wild: EmotiW 2015//Proceedings of 2015 ACM on International Conference on Multimodal Interaction. Seattle, USA: ACM: 423-426 [DOI: 10.1145/2818346.2829994http://dx.doi.org/10.1145/2818346.2829994]
Dinkel H, Wu M Y and Yu K. 2019. Text-based depression detection: what triggers an alert [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/1904.05154.pdfhttps://arxiv.org/pdf/1904.05154.pdf
Ekman P. 1999. Basic emotions//Dalgleish T and Power M J, eds. Handbook of Cognition and Emotion. New York, USA: John Wiley and Sons: 45-60 [DOI: 10.1002/0470013494.ch3http://dx.doi.org/10.1002/0470013494.ch3]
Esuli A and Sebastiani F. 2006. SENTIWORDNET: a publicly available lexical resource for opinion mining//Proceedings of the 5th International Conference on Language Resources and Evaluation. Genoa, Italy: European Language Resources Association (ELRA): 417-422
Eyben F, Wöllmer M and Schuller B. 2009. OpenEAR — introducing the Munich open-source emotion and affect recognition toolkit//Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. Amsterdam, the Netherlands: IEEE: 1-6 [DOI: 10.1109/ACII.2009.5349350http://dx.doi.org/10.1109/ACII.2009.5349350]
Eyben F, Wöllmer M and Schuller B. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor//Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy: ACM: 1459-1462 [DOI: 10.1145/1873951.1874246http://dx.doi.org/10.1145/1873951.1874246]
Fang M, Peng S Y, Liang Y J, Hung C C and Liu S H. 2023. A multimodal fusion model with multi-level attention mechanism for depression detection. Biomedical Signal Processing and Control, 82: #104561 [DOI: 10.1016/j.bspc.2022.104561http://dx.doi.org/10.1016/j.bspc.2022.104561]
Fitzpatrick K K, Darcy A and Vierhile M. 2017. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a randomized controlled trial. JMIR Mental Health, 4(2): #19 [DOI: 10.2196/mental.7785http://dx.doi.org/10.2196/mental.7785]
Fournier J C, DeRubeis R J, Hollon S D, Dimidjian S, Amsterdam J D, Shelton R C and Fawcett J. 2010. Antidepressant drug effects and depression severity: a patient-level meta-analysis. JAMA, 303(1): 47-53 [DOI: 10.1001/jama.2009.1943http://dx.doi.org/10.1001/jama.2009.1943]
Gandhi A, Adhvaryu K, Poria S, Cambria E and Hussain A. 2023. Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 91: 424-444 [DOI: 10.1016/j.inffus.2022.09.025http://dx.doi.org/10.1016/j.inffus.2022.09.025]
Gardiner P M, McCue K D, Negash L M, Cheng T, White L F, Yinusa-Nyahkoon L, Jack B W and Bickmore T W. 2017. Engaging women with an embodied conversational agent to deliver mindfulness and lifestyle recommendations: a feasibility randomized control trial. Patient Education and Counseling, 100(9): 1720-1729 [DOI: 10.1016/j.pec.2017.04.015http://dx.doi.org/10.1016/j.pec.2017.04.015]
Ghorbanali A, Sohrabi M K and Yaghmaee F. 2022. Ensemble transfer learning-based multimodal sentiment analysis using weighted convolutional neural networks. Information Processing and Management, 59(3): #102929 [DOI: 10.1016/j.ipm.2022.102929http://dx.doi.org/10.1016/j.ipm.2022.102929]
Gilbody S, Richards D, Brealey S and Hewitt C. 2007. Screening for depression in medical settings with the patient health questionnaire (PHQ): a diagnostic meta-analysis. Journal of General Internal Medicine, 22(11): 1596-1602 [DOI: 10.1007/s11606-007-0333-yhttp://dx.doi.org/10.1007/s11606-007-0333-y]
Gong Y and Poellabauer C. 2017. Topic modeling based multi-modal depression detection//Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. Mountain View, USA: ACM: 69-76 [DOI: 10.1145/3133944.3133945http://dx.doi.org/10.1145/3133944.3133945]
Gratch J, Artstein R, Lucas G, Stratou G, Scherer S, Nazarian A, Wood R, Boberg J, DeVault D, Marsella S, Traum D, Rizzo S and Morency L P. 2014. The distress analysis interview corpus of human and computer interviews//Proceedings of the 9th International Conference on Language Resources and Evaluation. Reykjavik, Iceland: European Language Resources Association (ELRA): 3123-3128
Guo W T, Yang H W, Liu Z Y, Xu Y P and Hu B. 2021. Deep neural networks for depression recognition based on 2D and 3D facial expressions under emotional stimulus tasks. Frontiers in Neuroscience, 15: #609760 [DOI: 10.3389/fnins.2021.609760http://dx.doi.org/10.3389/fnins.2021.609760]
Guo Y R, Liu J L, Wang L, Qin W, Hao S J and Hong R C. 2024. A prompt-based topic-modeling method for depression detection on low-resource data. IEEE Transactions on Computational Social Systems, 11(1): 1430-1439 [DOI: 10.1109/TCSS.2023.3260080http://dx.doi.org/10.1109/TCSS.2023.3260080]
Han W, Chen H and Poria S. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis//Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics: 9180-9192 [DOI: 10.18653/v1/2021.emnlp-main.723http://dx.doi.org/10.18653/v1/2021.emnlp-main.723]
Haque A, Guo M, Miner A S and Li F F. 2018. Measuring depression symptom severity from spoken language and 3D facial expressions [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/1811.08592.pdfhttps://arxiv.org/pdf/1811.08592.pdf
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]
He R D, Lee W S, Ng H T and Dahlmeier D. 2018. Adaptive semi-supervised learning for cross-domain sentiment classification//Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics: 3467-3476 [DOI: 10.18653/v1/D18-1383http://dx.doi.org/10.18653/v1/D18-1383]
Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation, 9(8): 1735-1780 [DOI: 10.1162/neco.1997.9.8.1735http://dx.doi.org/10.1162/neco.1997.9.8.1735]
Hoffmann J, Borgeaud S, Mensch A, Buchatskaya E, Cai T, Rutherford E, de Las Casas D, Hendricks L A, Welbl J, Clark A, Hennigan T, Noland E, Millican K, van den Driessche G, Damoc B, Guy A, Osindero S, Simonyan K, Elsen E, Rae J W, Vinyals O and Sifre L. 2022. Training compute-optimal large language models [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2203.15556.pdfhttps://arxiv.org/pdf/2203.15556.pdf
Hu G M, Lin T E, Zhao Y, Lu G M, Wu Y C and Li Y B. 2022. UniMSE: towards unified multimodal sentiment analysis and emotion recognition [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2211.11256.pdfhttps://arxiv.org/pdf/2211.11256.pdf
Hu Y, Hou S J, Yang H M, Huang H and He L. 2023. A joint network based on interactive attention for speech emotion recognition//Proceedings of 2023 IEEE International Conference on Multimedia and Expo (ICME). Brisbane, Australia: IEEE: 1715-1720 [DOI: 10.1109/ICME55011.2023.00295http://dx.doi.org/10.1109/ICME55011.2023.00295]
Huang Z S, Hu Q, Gu J G, Yang J, Feng Y and Wang G. 2019. Web-based intelligent agents for suicide monitoring and early warning. China Digital Medicine, 14(3): 2-6
黄智生, 胡青, 顾进广, 杨洁, 冯媛, 王刚. 2019. 网络智能机器人与自杀监控预警. 中国数字医学, 14(3): 2-6 [DOI: 10.3969/j.issn.1673-7571.2019.03.001http://dx.doi.org/10.3969/j.issn.1673-7571.2019.03.001]
Imbir K K. 2020. Psychoevolutionary theory of emotion (Plutchik)//Zeigler-Hill V and Shackelford T K, eds. Encyclopedia of Personality and Individual Differences. Cham: Springer: 4137-4144 [DOI: 10.1007/978-3-319-24612-3_547http://dx.doi.org/10.1007/978-3-319-24612-3_547]
Inkster B, Sarda S and Subramanian V. 2018. An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: real-world data evaluation mixed-methods study. JMIR mHealth and uHealth, 6(11): #12106 [DOI: 10.2196/12106http://dx.doi.org/10.2196/12106]
Joshi J, Goecke R, Alghowinem S, Dhall A, Wagner M, Epps J, Parker G and Breakspear M. 2013. Multimodal assistive technologies for depression diagnosis and monitoring. Journal on Multimodal User Interfaces, 7(3): 217-228 [DOI: 10.1007/s12193-013-0123-2http://dx.doi.org/10.1007/s12193-013-0123-2]
Joulin A, Grave E, Bojanowski P and Mikolov T. 2016. Bag of tricks for efficient text classification [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/1607.01759.pdfhttps://arxiv.org/pdf/1607.01759.pdf
Kroenke K, Spitzer R L and Williams J B. 2001. The PHQ-9: validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9): 606-613 [DOI: 10.1046/j.1525-1497.2001.016009606.xhttp://dx.doi.org/10.1046/j.1525-1497.2001.016009606.x]
Ku L W and Chen H H. 2007. Mining opinions from the web: beyond relevance retrieval. Journal of the American Society for Information Science and Technology, 58(12): 1838-1850 [DOI: 10.1002/asi.20630http://dx.doi.org/10.1002/asi.20630]
Lai S N, Hu X F, Xu H X, Ren Z X and Liu Z. 2023. Multimodal sentiment analysis: a survey [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2305.07611.pdfhttps://arxiv.org/pdf/2305.07611.pdf
Lam G, Huang D Y and Lin W S. 2019. Context-aware deep learning for multi-modal depression detection//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE: 3946-3950 [DOI: 10.1109/ICASSP.2019.8683027http://dx.doi.org/10.1109/ICASSP.2019.8683027]
Lei S L, Dong G T, Wang X P, Wang K H and Wang S R. 2023. InstructERC: reforming emotion recognition in conversation with a retrieval multi-task LLMs framework [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2309.11911.pdfhttps://arxiv.org/pdf/2309.11911.pdf
Li Y, Tao J H, Schuller B, Shan S G, Jiang D M and Jia J. 2016. MEC 2016: the multimodal emotion recognition challenge of CCPR 2016//Proceedings of the 7th Chinese Conference on Pattern Recognition. Chengdu, China: Springer: 667-678 [DOI: 10.1007/978-981-10-3005-5_55http://dx.doi.org/10.1007/978-981-10-3005-5_55]
Lian Z, Liu B and Tao J H. 2021. CTNet: conversational Transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 985-1000 [DOI: 10.1109/TASLP.2021.3049898http://dx.doi.org/10.1109/TASLP.2021.3049898]
Lian Z, Liu B and Tao J H. 2023a. SMIN: semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Transactions on Affective Computing, 14(3): 2415-2429 [DOI: 10.1109/TAFFC.2022.3141237http://dx.doi.org/10.1109/TAFFC.2022.3141237]
Lian Z, Sun H Y, Sun L C, Chen K, Xu M Y, Wang K X, Xu K, He Y, Li Y, Zhao J M, Liu Y, Liu B, Yi J Y, Wang M, Cambria E, Zhao G Y, Schuller B W and Tao J H. 2023b. MER 2023: multi-label learning, modality robustness, and semi-supervised learning [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2304.08981.pdfhttps://arxiv.org/pdf/2304.08981.pdf
Lian Z, Sun L C, Xu M Y, Sun H Y, Xu K, Wen Z F, Chen S, Liu B and Tao J H. 2023c. Explainable multimodal emotion reasoning [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2306.15401.pdfhttps://arxiv.org/pdf/2306.15401.pdf
Lin L, Chen X R, Shen Y and Zhang L. 2020. Towards automatic depression detection: a BiLSTM/1D CNN-based model. Applied Sciences, 10(23): #8701 [DOI: 10.3390/app10238701http://dx.doi.org/10.3390/app10238701]
Littlewort G, Whitehill J, Wu T F, Fasel I, Frank M, Movellan J and Bartlett M. 2011. The computer expression recognition toolbox (CERT)//Proceedings of 2011 IEEE International Conference on Automatic Face and Gesture Recognition (FG). Santa Barbara, USA: IEEE: 298-305 [DOI: 10.1109/FG.2011.5771414http://dx.doi.org/10.1109/FG.2011.5771414]
Liu H T, Li C Y, Wu Q Y and Lee Y J. 2023. Visual instruction tuning [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2304.08485.pdfhttps://arxiv.org/pdf/2304.08485.pdf
Liu P F, Qiu X P and Huang X J. 2016. Deep multi-task learning with shared memory [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/1609.07222.pdfhttps://arxiv.org/pdf/1609.07222.pdf
Liu T T, Liu Z, Chai Y J, Wang J and Wang Y Y. 2021. Agent affective computing in human-computer interaction. Journal of Image and Graphics, 26(12): 2767-2777
刘婷婷, 刘箴, 柴艳杰, 王瑾, 王媛怡. 2021. 人机交互中的智能体情感计算研究. 中国图象图形学报, 26(12): 2767-2777 [DOI: 10.11834/jig.200498http://dx.doi.org/10.11834/jig.200498]
Ly K H, Ly A M and Andersson G. 2017. A fully automated conversational agent for promoting mental well-being: a pilot RCT using mixed methods. Internet Interventions, 10: 39-46 [DOI: 10.1016/j.invent.2017.10.002http://dx.doi.org/10.1016/j.invent.2017.10.002]
Ma X C, Yang H Y, Chen Q, Huang D and Wang Y H. 2016. DepAudioNet: an efficient deep model for audio based depression classification//Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. Amsterdam, the Netherlands: ACM: 35-42 [DOI: 10.1145/2988257.2988267http://dx.doi.org/10.1145/2988257.2988267]
McFee B, Raffel C, Liang D, Ellis D, McVicar M, Battenberg E and Nieto O. 2015. Librosa: audio and music signal analysis in python//Proceedings of the 14th Python in Science Conference. 18-25 [DOI: 10.25080/majora-7b98e3ed-003http://dx.doi.org/10.25080/majora-7b98e3ed-003]
Mehrabian A. 1996. Pleasure-arousal-dominance: a general framework for describing and measuring individual differences in temperament. Current Psychology, 14(4): 261-292 [DOI: 10.1007/BF02686918http://dx.doi.org/10.1007/BF02686918]
Mendels G, Levitan S, Lee K Z and Hirschberg J. 2017. Hybrid acoustic-lexical deep learning approach for deception detection//Interspeech 2017. Stockholm, Sweden: ISCA: 1472-1476 [DOI: 10.21437/Interspeech.2017-1723http://dx.doi.org/10.21437/Interspeech.2017-1723]
Mikolov T, Chen K, Corrado G and Dean J. 2013. Efficient estimation of word representations in vector space [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/1301.3781.pdfhttps://arxiv.org/pdf/1301.3781.pdf
Minsky M. 1988. The Society of Mind. New York, USA: Simon and Schuster
Mohammad S M and Turney P D. 2013. NRC Emotion Lexicon. National Research Council of Canada [DOI: 10.4224/21270984http://dx.doi.org/10.4224/21270984]
Morales M R, Scherer S and Levitan R. 2017. OpenMM: an open-source multimodal feature extraction tool//Interspeech 2017. Stockholm, Sweden: ISCA: 3354-3358 [DOI: 10.21437/Interspeech.2017-1382http://dx.doi.org/10.21437/Interspeech.2017-1382]
Pasikowska A, Zaraki A and Lazzeri N. 2013. A dialogue with a virtual imaginary interlocutor as a form of a psychological support for well-being//Proceedings of the International Conference on Multimedia, Interaction, Design and Innovation. Warsaw Poland: ACM: 1-15 [DOI: 10.1145/2500342.2500359http://dx.doi.org/10.1145/2500342.2500359]
Pennington J, Socher R and Manning C. 2014. GloVe: global vectors for word representation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics: 1532-1543 [DOI: 10.3115/v1/D14-1162http://dx.doi.org/10.3115/v1/D14-1162]
Pham H, Liang P P, Manzini T, Morency L P and Póczos B. 2019. Found in translation: learning robust joint representations by cyclic translations between modalities//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI: 6892-6899 [DOI: 10.1609/aaai.v33i01.33016892http://dx.doi.org/10.1609/aaai.v33i01.33016892]
Poria S, Cambria E and Gelbukh A. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis//Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics: 2539-2544 [DOI: 10.18653/v1/D15-1303http://dx.doi.org/10.18653/v1/D15-1303]
Poria S, Hazarika D, Majumder N, Naik G, Cambria E and Mihalcea R. 2019. MELD: a multimodal multi-party dataset for emotion recognition in conversations//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics: 527-536 [DOI: 10.18653/v1/P19-1050http://dx.doi.org/10.18653/v1/P19-1050]
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. PMLR: 139: 8748-8763
Ringeval F, Schuller B, Valstar M, Cowie R, Kaya H, Schmitt M, Amiriparian S, Cummins N, Lalanne D, Michaud A, Ciftçi E, Güleç H, Salah A A and Pantic M. 2018. AVEC 2018 workshop and challenge: bipolar disorder and cross-cultural affect recognition//Proceedings of 2018 on Audio/Visual Emotion Challenge and Workshop. Seoul, Korea (South): ACM: 3-13 [DOI: 10.1145/3266302.3266316http://dx.doi.org/10.1145/3266302.3266316]
Rizzo A A, Lange B, Buckwalter J G, Forbell E, Kim J, Sagae K, Williams J, Rothbaum B O, Difede J, Reger G, Parsons T and Kenny P. 2011. An intelligent virtual human system for providing healthcare information and support. Studies in Health Technology and Informatics, 163: 503-509
Ruggiero K J, Ben K D, Scotti J R and Rabalais A E. 2003. Psychometric properties of the PTSD checklist—civilian version. Journal of Traumatic Stress, 16(5): 495-502 [DOI: 10.1023/A:1025714729117http://dx.doi.org/10.1023/A:1025714729117]
Rush A J, Carmody T J, Ibrahim H M, Trivedi M H, Biggs M M, Shores-Wilson K, Crismon M L, Toprac M G and Kashner T M. 2006. Comparison of self-report and clinician ratings on two inventories of depressive symptomatology. Psychiatric Services, 57(6): 829-837 [DOI: 10.1176/ps.2006.57.6.829http://dx.doi.org/10.1176/ps.2006.57.6.829]
Scherer S, Stratou G, Gratch J and Morency L P. 2013. Investigating voice quality as a speaker-independent indicator of depression and PTSD//Interspeech 2013. Lyon, France: [s.n.]: 847-851 [DOI: 10.21437/Interspeech.2013-240http://dx.doi.org/10.21437/Interspeech.2013-240]
Scherer S, Stratou G, Lucas G, Mahmoud M, Boberg J, Gratch J, Rizzo A and Morency L P. 2014. Automatic audiovisual behavior descriptors for psychological disorder analysis. Image and Vision Computing, 32(10): 648-658 [DOI: 10.1016/j.imavis.2014.06.001http://dx.doi.org/10.1016/j.imavis.2014.06.001]
Schroff F, Kalenichenko D and Philbin J. 2015. FaceNet: a unified embedding for face recognition and clustering//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 815-823 [DOI: 10.1109/CVPR.2015.7298682http://dx.doi.org/10.1109/CVPR.2015.7298682]
Schuller B, Valstar M, Eyben F, McKeown G, Cowie R and Pantic M. 2011. Avec 2011–the first international audio/visual emotion challenge//Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction. Memphis, USA: Springer: 415-424 [DOI: 10.1007/978-3-642-24571-8_53http://dx.doi.org/10.1007/978-3-642-24571-8_53]
Sebe N, Cohen I, Gevers T and Huang T S. 2005. Multimodal approaches for emotion recognition: a survey//Proceedings Volume 5670, Internet Imaging VI. San Jose, USA: SPIE: 56-67 [DOI: 10.1117/12.600746http://dx.doi.org/10.1117/12.600746]
Shaver P, Schwartz J, Kirson D and O’Connor C. 1987. Emotion knowledge: further exploration of a prototype approach. Journal of Personality and Social Psychology, 52(6): 1061-1086 [DOI: 10.1037//0022-3514.52.6.1061http://dx.doi.org/10.1037//0022-3514.52.6.1061]
Shen Y, Yang H Y and Lin L. 2022. Automatic depression detection: an emotional audio-textual corpus and a GRU/BiLSTM-based model//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE: 6247-6251 [DOI: 10.1109/ICASSP43922.2022.9746569http://dx.doi.org/10.1109/ICASSP43922.2022.9746569]
Shott S. 1979. Emotion and social life: a symbolic interactionist analysis. American Journal of Sociology, 84(6): 1317-1334 [DOI: 10.1086/226936http://dx.doi.org/10.1086/226936]
Soleymani M, Garcia D, Jou B, Schuller B, Chang S F and Pantic M. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing, 65: 3-14 [DOI: 10.1016/j.imavis.2017.08.003http://dx.doi.org/10.1016/j.imavis.2017.08.003]
Spek V, Cuijpers P, Nyklícek I, Riper H, Keyzer J and Pop V. 2007. Internet-based cognitive behaviour therapy for symptoms of depression and anxiety: a meta-analysis. Psychological Medicine, 37(3): 319-328 [DOI: 10.1017/S0033291706008944http://dx.doi.org/10.1017/S0033291706008944]
Su W J, Zhu X Z, Cao Y, Li B, Lu L W, Wei F R and Dai J F. 2020. VL-BERT: pre-training of generic visual-linguistic representations [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/1908.08530.pdfhttps://arxiv.org/pdf/1908.08530.pdf
Su Y X, Lan T, Li H Y, Xu J L, Wang Y and Cai D. 2023. PandaGPT: one model to instruction-follow them all [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2305.16355.pdfhttps://arxiv.org/pdf/2305.16355.pdf
Sun B, Zhang Y H, He J, Yu L J, Xu Q H, Li D L and Wang Z Y. 2017. A random forest regression method with selected-text feature for depression assessment//Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. Mountain View, USA: ACM: 61-68 [DOI: 10.1145/3133944.3133951http://dx.doi.org/10.1145/3133944.3133951]
Sun S T, Chen H Y, Shao X X, Liu L L, Li X W and Hu B. 2020. EEG based depression recognition by combining functional brain network and traditional biomarkers//Proceedings of 2020 IEEE International Conference on Bioinformatics and Biomedicine. Seoul, Korea (South): IEEE: 2074-2081 [DOI: 10.1109/BIBM49941.2020.9313270http://dx.doi.org/10.1109/BIBM49941.2020.9313270]
Tomkins S S. 1962. Affect Imagery Consciousness: Volume I: The Positive Affects. New York, USA: Springer
Torous J, Chan S R, Tan S Y M, Behrens J, Mathew I, Conrad E J, Hinton L, Yellowlees P and Keshavan M. 2014. Patient smartphone ownership and interest in mobile apps to monitor symptoms of mental health conditions: a survey in four geographically distinct psychiatric clinics. JMIR Mental Health, 1(1): #5 [DOI: 10.2196/mental.4004http://dx.doi.org/10.2196/mental.4004]
Valstar M, Schuller B, Smith K, Eyben F, Jiang B H, Bilakhia S, Schnieder S, Cowie R and Pantic M. 2013. AVEC 2013: the continuous audio/visual emotion and depression recognition challenge//Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. Barcelona, Spain: ACM: 3-10 [DOI: 10.1145/2512530.2512533http://dx.doi.org/10.1145/2512530.2512533]
Wang D, Guo X T, Tian Y M, Liu J H, He L H and Luo X M. 2023. TETFN: a text enhanced Transformer fusion network for multimodal sentiment analysis. Pattern Recognition, 136: #109259 [DOI: 10.1016/j.patcog.2022.109259http://dx.doi.org/10.1016/j.patcog.2022.109259]
Weizenbaum J. 1966. ELIZA — a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1): 36-45 [DOI: 10.1145/365153.365168http://dx.doi.org/10.1145/365153.365168]
Williamson J R, Godoy E, Cha M, Schwarzentruber A, Khorrami P, Gwon Y, Kung H T, Dagli C and Quatieri T F. 2016. Detecting depression using vocal, facial and semantic communication cues//Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. Amsterdam, the Netherlands: ACM: 11-18 [DOI: 10.1145/2988257.2988263http://dx.doi.org/10.1145/2988257.2988263]
World Health Organization. 2020a. Depression 2020a [EB/OL]. [2023-12-23]. https://www.who.int/health-topics/depressionhttps://www.who.int/health-topics/depression
World Health Organization. 2020b. Mental health in China 2020b [EB/OL]. [2023-12-23]. https://www.who.int/china/health-topics/mental-healthhttps://www.who.int/china/health-topics/mental-health
Wu S X, Dai D M, Qin Z W, Liu T Y, Lin B H, Cao Y B and Sui Z F. 2023. Denoising bottleneck with mutual information maximization for video multimodal fusion [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2305.14652.pdfhttps://arxiv.org/pdf/2305.14652.pdf
Wu Y, Zhao Y Y, Yang H, Chen S, Qin B, Cao X H and Zhao W T. 2022. Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2203.00257.pdfhttps://arxiv.org/pdf/2203.00257.pdf
Xiao J Q and Luo X X. 2022. A survey of sentiment analysis based on multi-modal information//Proceedings of 2022 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC). Dalian, China: IEEE: 712-715 [DOI: 10.1109/IPEC54454.2022.9777333http://dx.doi.org/10.1109/IPEC54454.2022.9777333]
Xu L H, Lin H F, Pan Y, Ren H and Chen J M. 2008. Constructing the affective lexicon ontology. Journal of the China Society for Scientific and Technical Information, 27(2): 180-185
徐琳宏, 林鸿飞, 潘宇, 任惠, 陈建美. 2008. 情感词汇本体的构造. 情报学报, 27(2): 180-185 [DOI: 10.3969/j.issn.1000-0135.2008.02.004http://dx.doi.org/10.3969/j.issn.1000-0135.2008.02.004]
Yang B, Wu L J, Zhu J H, Shao B, Lin X L and Liu T Y. 2022. Multimodal sentiment analysis with two-phase multi-task learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 2015-2024 [DOI: 10.1109/TASLP.2022.3178204http://dx.doi.org/10.1109/TASLP.2022.3178204]
Yang L, Jiang D M, He L, Pei E C, Oveneke M C and Sahli H. 2016. Decision tree based depression classification from audio video and language information//Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. Amsterdam, the Netherlands: ACM: 89-96 [DOI: 10.1145/2988257.2988269http://dx.doi.org/10.1145/2988257.2988269]
Yang L, Jiang D M, Xia X H, Pei E C, Oveneke M C and Sahli H. 2017. Multimodal measurement of depression using deep learning models//Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. Mountain View, USA: ACM: 53-59 [DOI: 10.1145/3133944.3133948http://dx.doi.org/10.1145/3133944.3133948]
Yang Y, Fairbairn C and Cohn J F. 2013. Detecting depression severity from vocal prosody. IEEE Transactions on Affective Computing, 4(2): 142-150 [DOI: 10.1109/T-AFFC.2012.38http://dx.doi.org/10.1109/T-AFFC.2012.38]
Yap M H, See J, Hong X P and Wang S J. 2018. Facial micro-expressions grand challenge 2018 summary//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018). Xi’an, China: IEEE: 675-678 [DOI: 10.1109/FG.2018.00106http://dx.doi.org/10.1109/FG.2018.00106]
Ye J Y, Yu Y H, Wang Q X, Li W T, Liang H, Zheng Y S and Fu G. 2021. Multi-modal depression detection based on emotional audio and evaluation text. Journal of Affective Disorders, 295: 904-913 [DOI: 10.1016/j.jad.2021.08.090http://dx.doi.org/10.1016/j.jad.2021.08.090]
Yi G F, Yang Y G, Pan Y, Cao Y H, Yao J X, Lv X, Fan C H, Lv Z, Tao J H, Liang S and Lu H. 2023. Exploring the power of cross-contextual large language model in mimic emotion prediction//Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation. Ottawa, Canada: Association for Computing Machinery: 19-26 [DOI: 10.1145/3606039.3613109http://dx.doi.org/10.1145/3606039.3613109]
Yin S, Liang C, Ding H Y and Wang S F. 2019. A multi-modal hierarchical recurrent neural network for depression detection//Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. Nice, France: ACM: 65-71 [DOI: 10.1145/3347320.3357696http://dx.doi.org/10.1145/3347320.3357696]
Yu H L, Gui L K, Madaio M, Ogan A, Cassell J and Morency L P. 2017. Temporally selective attention model for social and affective state recognition in multimedia content//Proceedings of the 25th ACM international conference on Multimedia. Mountain View, USA: ACM: 1743-1751 [DOI: 10.1145/3123266.3123413http://dx.doi.org/10.1145/3123266.3123413]
Yu W M, Xu H, Meng F Y, Zhu Y L, Ma Y X, Wu J L, Zou J Y and Yang K C. 2020. CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics: 3718-3727 [DOI: 10.18653/v1/2020.acl-main.343http://dx.doi.org/10.18653/v1/2020.acl-main.343]
Yu W M, Xu H, Yuan Z Q and Wu J L. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis//Proceedings of the 35th AAAI Conference on Artificial Intelligence. [s.l.]: AAAI: 10790-10797 [DOI: 10.1609/aaai.v35i12.17289http://dx.doi.org/10.1609/aaai.v35i12.17289]
Zadeh A, Chen M H, Poria S, Cambria E and Morency L P. 2017a. Tensor fusion network for multimodal sentiment analysis [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/1707.07250.pdfhttps://arxiv.org/pdf/1707.07250.pdf
Zadeh A, Chen M H, Poria S, Cambria E and Morency L P. 2017b. Tensor fusion network for multimodal sentiment analysis//Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics: 1103-1114 [DOI: 10.18653/v1/D17-1115http://dx.doi.org/10.18653/v1/D17-1115]
Zadeh A A B, Liang P P, Poria S, Cambria E and Morency L P. 2018a. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics: 2236-2246 [DOI: 10.18653/v1/P18-1208http://dx.doi.org/10.18653/v1/P18-1208]
Zhang F, Li X C, Lim C P, Hua Q, Dong C R and Zhai J H. 2022. Deep emotional arousal network for multimodal sentiment analysis and emotion recognition. Information Fusion, 88: 296-304 [DOI: 10.1016/j.inffus.2022.07.006http://dx.doi.org/10.1016/j.inffus.2022.07.006]
Zhang J, Xue S Y, Wang X Y and Liu J. 2023. Survey of multimodal sentiment analysis based on deep learning//Proceedings of the 9th IEEE International Conference on Cloud Computing and Intelligent Systems (CCIS). Dali, China: IEEE: 446-450 [DOI: 10.1109/CCIS59572.2023.10263012http://dx.doi.org/10.1109/CCIS59572.2023.10263012]
Zhang P Y, Wu M Y, Dinkel H and Yu K. 2021. DEPA: self-supervised audio embedding for depression detection//Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM: 135-143 [DOI: 10.1145/3474085.3479236http://dx.doi.org/10.1145/3474085.3479236]
Zhao J M, Zhang T G, Hu J W, Liu Y C, Jin Q, Wang X C and Li H Z. 2022. M3ED: multi-modal multi-scene multi-label emotional dialogue database//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics: 5699-5710 [DOI: 10.18653/v1/2022.acl-long.391http://dx.doi.org/10.18653/v1/2022.acl-long.391]
Zhu D Y, Chen J, Shen X Q, Li X and Elhoseiny M. 2023a. MiniGPT-4: enhancing vision-language understanding with advanced large language models [EB/OL]. [2023-12-23]. https://arxiv.org/pdf/2304.10592.pdfhttps://arxiv.org/pdf/2304.10592.pdf
Zhu L N, Zhu Z C, Zhang C W, Xu Y F and Kong X J. 2023b. Multimodal sentiment analysis based on fusion methods: a survey. Information Fusion, 95: 306-325 [DOI: 10.1016/j.inffus.2023.02.028http://dx.doi.org/10.1016/j.inffus.2023.02.028]
Zou B C, Han J L, Wang Y X, Liu R, Zhao S H, Feng L, Lyu X W and Ma H M. 2023. Semi-structural interview-based Chinese multimodal depression corpus towards automatic preliminary screening of depressive disorders. IEEE Transactions on Affective Computing, 14(4): 2823-2838 [DOI: 10.1109/TAFFC.2022.3181210http://dx.doi.org/10.1109/TAFFC.2022.3181210]
相关作者
相关机构