语音深度伪造及其检测技术研究进展

许裕雄; 李斌; 谭舜泉; 黄继武

doi:10.11834/jig.230476

综述 | 浏览量 : 0 下载量: 2942 CSCD: 0

PDF
导出
分享
收藏
专辑

语音深度伪造及其检测技术研究进展
Research progress on speech deepfake and its detection techniques
2024年29卷第8期页码：2236-2268
收稿日期：2023-07-10，

修回日期：2023-11-10，

纸质出版日期：2024-08-16
DOI： 10.11834/jig.230476
稿件说明：

移动端阅览

许裕雄，李斌，谭舜泉，黄继武. 2024. 语音深度伪造及其检测技术研究进展. 中国图象图形学报， 29(08):2236-2268 DOI： 10.11834/jig.230476.

Xu Yuxiong， Li Bin， Tan Shunquan， Huang Jiwu. 2024. Research progress on speech deepfake and its detection techniques. Journal of Image and Graphics， 29(08):2236-2268 DOI： 10.11834/jig.230476.

摘要

语音深度伪造技术是利用深度学习方法进行合成或生成语音的技术。人工智能生成内容技术的快速迭代与优化，推动了语音深度伪造技术在伪造语音的自然度、逼真度和多样性等方面取得显著提升，同时也使得语音深度伪造检测技术面临着巨大挑战。本文对语音深度伪造及其检测技术的研究进展进行全面梳理回顾。首先，介绍以语音合成（speech synthesis，SS）和语音转换（voice conversion，VC）为代表的伪造技术。然后，介绍语音深度伪造检测领域的常用数据集和相关评价指标。在此基础上，从数据增强、特征提取和优化以及学习机制等处理流程的角度对现有的语音深度伪造检测技术进行分类与深入分析。具体而言，从语音加噪、掩码增强、信道增强和压缩增强等数据增强的角度来分析不同增强方式对伪造检测技术性能的影响，从基于手工特征的伪造检测、基于混合特征的伪造检测、基于端到端的伪造检测和基于特征融合的伪造检测等特征提取和优化的角度对比分析各类方法的优缺点，从自监督学习、对抗训练和多任务学习等学习机制的角度对伪造检测技术的训练方式进行探讨。最后，总结分析语音深度伪造检测技术存在的挑战性问题，并对未来研究进行展望。本文汇总的相关数据集和代码可在

https://github.com/media-sec-lab/Audio-Deepfake-Detection

访问。

Abstract

Speech deepfake technology， which employs deep learning methods to synthesize or generate speech， has emerged as a critical research hotspot in multimedia information security. The rapid iteration and optimization of artificial intelligence-generated content technologies have significantly advanced speech deepfake tec

hniques. These advancements have significantly enhanced the naturalness， fidelity， and diversity of synthesized speech. However， they have also presented great challenges for speech deepfake detection technology. To address these challenges， this study comprehensively reviews recent research progress on speech deepfake generation and its detection techniques. Based on an extensive literature survey， this study first introduces the research background of speech forgery and its detection and compares and analyzes previously published reviews in this field. Second， this study provides a concise overview of speech deepfake generation， especially speech synthesis （SS） and voice conversion （VC）. SS， which is commonly known as text-to-speech （TTS）， analyzes text and generates speech that aligns with the provided input by applying linguistic rules for text description. Various deep models are employed in TTS， including sequence-to-sequence models， flow models， generative adversarial network models， variational auto-encoder models， and diffusion models. VC involves modifying acoustic features， such as emotion， accent， pronunciation， and speaker identity， to produce speech resembling human-like speech. VC algorithms can be categorized as single， multiple， and arbitrary target speech conversion depending on the number of target speakers. Third， this study briefly introduces commonly used datasets in speech deepfake detection and provides relevant access links to open-source datasets. This study briefly introduces two commonly used evaluation metrics in speech deepfake detection： the equal error rate and the tandem detection cost function. This study analyzes and categorizes the existing deep speech forgery detection techniques in detail. The pros and cons of different detection techniques are studied and compared in depth， focusing primarily on data processing， feature extraction and optimization， and learning mechanisms. Notably， this study summarizes the experimental results of existing detection techniques on the ASVspoo

f 2019 and 2021 datasets in tabular form. Within this context， the primary focus of this study is to investigate the generality of current detection techniques in the field of speech deepfake detection without focusing on specific forgery attack methods. Data augmentation involves a series of transformations on the original speech data. These include speech noise addition， mask enhancement， channel enhancement， and compression enhancement， each aiming to simulate complex real-world acoustic environments more effectively. Among them， one of the most common data processing methods is speech noise addition， which aims to interfere with the speech signal by adding noise to simulate the complex acoustic environment of a real scenario as much as possible. Mask enhancement is the masking operation on the time or frequency domain of speech to achieve noise suppression and enhancement of the speech signal for improving the accuracy and robustness of speech detection techniques. Transmission channel enhancement focuses on solving the problems of signal attenuation， data loss， and noise interference caused by changes in the codec and transmission channel of speech data. Compression enhancement techniques address the problem of degradation of speech quality during data compression. In particular， the main data compression methods are MP3， M4A， and OGG. From the perspective of feature extraction and optimization， speech deepfake detection can be divided into handcrafted feature-， hybrid feature-， deep feature-， and feature fusion-based methods. Handcrafted features refer to speech features extracted with the help of certain prior knowledge， which mainly include constant-Q transform， linear frequency cepstral coefficients， and Mel-spectrogram. By contrast， feature-based hybrid forgery detection methods utilize the domain knowledge provided by handcrafted features to mine richer information about speech representations through deep learning networks. End-to-end forgery detection methods directly learn feature representation and

classification models from raw speech signals， which eliminates the need for handcrafted feature extraction. This way allows the model to discover discriminative features from the input data automatically. Moreover， these detection techniques can be trained using a single feature. Alternatively， feature-level fusion forgery detection can be employed to combine multiple features， whether they are identical or different. Techniques such as weighted aggregation and feature concatenation are used for feature-level fusion. The detection techniques can capture richer speech information by fusing these features， which improves performance. For the learning mechanism， this study explores the impact of different training methods on forgery detection techniques， especially self-supervised learning， adversarial training， and multi-task learning. Self-supervised learning plays an important role in forgery detection techniques by automatically generating auxiliary targets or labels from speech data to train models. Fine-tuning the self-supervised-based pretrained model can effectively distinguish between real and forged speech. Then， adversarial training-based forgery detection enhances the robustness and generalization of the model by adding adversarial samples to the training data. In contrast to binary classification tasks， the forgery detection based on multi-task learning captures more comprehensive and useful speech feature information from different speech-related tasks by sharing the underlying feature representations. This approach improves the detection performance of the model while effectively utilizing speech training data. Although speech deepfake detection techniques have achieved excellent performance in some datasets， their performance is less satisfactory when testing speech data from natural scenarios. Analysis of the existing research shows that the main future research directions are to establish diversified speech deepfake datasets， study adversarial samples or data enhancement methods for enhancing the

robustness of speech deepfake detection techniques， establish generalized speech deepfake detection techniques， and explore interpretable speech deepfake detection techniques. The relevant datasets and code mentioned can be accessed from

https://github.com/media-sec-lab/Audio-Deepfake-Detection

关键词

Keywords

references

Aihara R ， Takiguchi T and Ariki Y . 2013 . Individuality-preserving voice conversion for articulation disorders using locality-constrained NMF // Proceedings of the 4th Workshop on Speech and Language Processing for Assistive Technologies . Grenoble， France ： Association for Computational Linguistics： 3 - 8

Almutairi Z and Elgibreen H . 2022 . A review of modern audio deepfake detection methods： challenges and future directions . Algorithms ， 15 （ 5 ）： # 155 ［ DOI： 10.3390/a15050155 http://dx.doi.org/10.3390/a15050155 ］

Arif T ， Javed A ， Alhameed M ， Jeribi F and Tahir A . 2021 . Voice spoofing countermeasure for logical access attacks detection . IEEE Access ， # 9 ： 162857 - 162868 ［ DOI： 10.1109/ACCESS.2021.3133134 http://dx.doi.org/10.1109/ACCESS.2021.3133134 ］

Arik S Ö ， Chrzanowski M ， Coates A ， Diamos G ， Gibiansky A ， Kang Y G ， Li X ， Miller J ， Ng A ， Raiman J ， Sengupta S and Shoeybi M . 2017a . Deep voice： real-time neural text-to-speech // Proceedings of the 34th International Conference on Machine Learning . Sydney， Australia ： JMLR.org： 195 - 204 ［ DOI： 10.5555/3305381.3305402 http://dx.doi.org/10.5555/3305381.3305402 ］

Arik S Ö ， Diamos G ， Gibiansky A ， Miller J ， Peng K N ， Ping W ， Raiman J and Zhou Y Q . 2017b . Deep voice 2： multi-speaker neural text-to-speech // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 2966 - 2974 ［ DOI： 10.5555/3294996.3295056 http://dx.doi.org/10.5555/3294996.3295056 ］

Attorresi L ， Salvi D ， Borrelli C ， Bestagini P and Tubaro S . 2022 . Combining automatic speaker verification and prosody analysis for synthetic speech detection // Pattern Recognition， Computer Vision， and Image Processing. ICPR 2022 International Workshops and Challenges . Montréal， Canada ： Springer-Verlag： 247 - 263 ［ DOI： 10.1007/978-3-031-37742-6_21 http://dx.doi.org/10.1007/978-3-031-37742-6_21 ］

Ba Z J ， Wen Q ， Cheng P ， Wang Y W ， Lin F ， Lu L and Liu Z G . 2023 . Transferring audio deepfake detection capability across languages // Proceedings of 2023 ACM Web Conference . Austin， USA ： ACM： 2033 - 2044 ［ DOI： 10.1145/3543507.3583222 http://dx.doi.org/10.1145/3543507.3583222 ］

Bevinamarad P R and Shirldonkar M S . 2020 . Audio forgery detection techniques： present and past review // Proceedings of the 4th International Conference on Trends in Electronics and Informatics （ICOEI）（48184） . Tirunelveli， India ： IEEE： 613 - 618 ［ DOI： 10.1109/ICOEI48184.2020.9143014 http://dx.doi.org/10.1109/ICOEI48184.2020.9143014 ］

Bińkowski M ， Donahue J ， Dieleman S ， Clark A ， Elsen E ， Casagrande N ， Cobo L C and Simonyan K . 2019 . High fidelity speech synthesis with adversarial networks // Proceedings of the 8th International Conference on Learning Representations . Addis Ababa， Ethiopia ： ICLR

C􀅡ceres J ， Font R ， Grau T and Molina J . 2021 . The biometric vox system for the ASVspoof 2021 challenge //Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 68 - 74 ［ DOI： 10.21437/ASVSPOOF.2021-11 http://dx.doi.org/10.21437/ASVSPOOF.2021-11 ］

Cai Z X and Li M . 2022 . Invertible voice conversion ［EB/OL］. ［ 2023-06-30 ］. http://arxiv.org/pdf/2201.10687.pdf http://arxiv.org/pdf/2201.10687.pdf

Chen N X ， Zhang Y ， Zen H G ， Weiss R J ， Norouzi M and Chan W . 2020a . WaveGrad： estimating gradients for waveform generation // Proceedings of the 9th International Conference on Learning Representations . Virtual Event ： ICLR

Chen T X ， Khoury E ， Phatak K and Sivaraman G . 2021a . Pindrop Labs’ submission to the ASVspoof 2021 challenge //Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 89 - 93 ［ DOI： 10.21437/ASVSPOOF.2021-14 http://dx.doi.org/10.21437/ASVSPOOF.2021-14 ］

Chen T X ， Kumar A ， Nagarsheth P ， Sivaraman G and Khoury E . 2020b . Generalization of audio deepfake detection // The Speaker and Language Recognition Workshop （Odyssey 2020） . Tokyo， Japan ： ISCA： 132 - 137 ［ DOI： 10.21437/Odyssey.2020-19 http://dx.doi.org/10.21437/Odyssey.2020-19 ］

Chen X H ， Zhang Y ， Zhu G and Duan Z Y . 2021b . UR channel-robust synthetic speech detection system for ASVspoof 2021//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA ： 75 - 82 ［ DOI： 10.21437/ASVSPOOF.2021-12 http://dx.doi.org/10.21437/ASVSPOOF.2021-12 ］

Chen Y N ， Chu M ， Chang E ， Liu J and Liu R S . 2003 . Voice conversion with smoothed GMM and MAP adaptation // Proceedings of the 8th European Conference on Speech Communication and Technology （Eurospeech 2003） . Geneva， Switzerland ： ISCA： 2413 - 2416 ［ DOI： 10.21437/Eurospeech.2003-664 http://dx.doi.org/10.21437/Eurospeech.2003-664 ］

Choi S ， Kwak I Y and Oh S . 2022 . Overlapped frequency-distributed network： frequency-aware voice spoofing countermeasure // Proceedings of the 23rd Annual Conference of the International Speech Communication Association . Incheon， Korea （South）： ISCA： 3558 - 3562 ［ DOI： 10.21437/Interspeech.2022-657 http://dx.doi.org/10.21437/Interspeech.2022-657 ］

Chou J C and Lee H Y . 2019 . One-shot voice conversion by separating speaker and content representations with instance normalization // Proceedings of Interspeech 2019 ， the 20th Annual Conference of the International Speech Communication Association. Graz， Austria ： ISCA： 664 - 668 ［ DOI： 10.21437/Interspeech.2019-2663 http://dx.doi.org/10.21437/Interspeech.2019-2663 ］

Cong J ， Yang S ， Xie L and Su D . 2021 . Glow-WaveGAN： learning speech representations from GAN-based variational auto-encoder for high fidelity flow-based speech synthesis // Proceedings of Interspeech 2021 ， the 22nd Annual Conference of the International Speech Communication Association. Brno， Czechia： 2182 - 2186 ［ DOI： 10.21437/Interspeech.2021-414 http://dx.doi.org/10.21437/Interspeech.2021-414 ］

Cohen A ， Rimon I ， Aflalo E and Permuter H H . 2022 . A study on data augmentation in voice anti-spoofing . Speech Communication ， 141 ： 56 - 67 ［ DOI： 10.1016/j.specom.2022.04.005 http://dx.doi.org/10.1016/j.specom.2022.04.005 ］

Das R K . 2021 . Known-unknown data augmentation strategies for detection of logical access， physical access and speech deepfake attacks： ASVspoof 2021//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA ： 29 - 36 ［ DOI： 10.21437/ASVSPOOF.2021-5 http://dx.doi.org/10.21437/ASVSPOOF.2021-5 ］

Das R K ， Yang J C and Li H Z . 2021 . Data augmentation with signal companding for detection of logical access attacks // Proceedings of 2021 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Toronto， Canada ： IEEE： 6349 - 6353 ［ DOI： 10.1109/ICASSP39728.2021.9413501 http://dx.doi.org/10.1109/ICASSP39728.2021.9413501 ］

Delgado H ， Evans N ， Kinnunen T ， Lee K A ， Liu X C ， Nautsch A ， Patino J ， Sahidullah M ， Todisco M ， Wang X and Yamagishi J . 2021 . ASVspoof 2021： automatic speaker verification spoofing and countermeasures challenge evaluation plan ［EB/OL］. ［ 2023-06-30 ］. https://arxiv.org/pdf/2109.00535.pdf https://arxiv.org/pdf/2109.00535.pdf

Dhar S ， Jana N D and Das S . 2023 . An adaptive-learning-based generative adversarial network for one-to-one voice conversion . IEEE Transactions on Artificial Intelligence ， 4 （ 1 ）： 92 - 106 ［ DOI： 10.1109/TAI.2022.3149858 http://dx.doi.org/10.1109/TAI.2022.3149858 ］

Dixit A ， Kaur N and Kingra S . 2023 . Review of audio deepfake detection techniques： issues and prospects . Expert Systems ， 40 （ 8 ）： #e 13322 ［ DOI： 10.1111/exsy.13322 http://dx.doi.org/10.1111/exsy.13322 ］

Donahue C ， McAuley J and Puckette M . 2018 . Adversarial audio synthesis // Proceedings of the 7th International Conference on Learning Representations . OrleansNew， USA ： ICLR

Elias I ， Zen H G ， Shen J ， Zhang Y ， Jia Y ， W eiss R J and Wu Y H . 2021 . Parallel tacotron： non-autoregressive and controllable TTS // Proceedings of 2021 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Toronto， Canada ： IEEE： 5709 - 5713 ［ DOI： 10.1109/ICASSP39728.2021.9414718 http://dx.doi.org/10.1109/ICASSP39728.2021.9414718 ］

Ergünay S K ， Khoury E ， Lazaridis A and Marcel S . 2015 . On the vulnerability of speaker verification to realistic voice spoofing // Proceedings of the 7th International Conference on Biometrics Theory， Applications and Systems （BTAS） . Arlington， USA ： IEEE： 1 - 6 ［ DOI： 10.1109/BTAS.2015.7358783 http://dx.doi.org/10.1109/BTAS.2015.7358783 ］

Fathan A ， Alam J and Kang W H . 2022 . Mel-spectrogram image-based end-to-end audio deepfake detection under channel-mismatched conditions // Proceedings of 2022 IEEE International Conference on Multimedia and Expo （ICME） . Taipei， China ： IEEE： 1 - 6 ［ DOI： 10.1109/ICME52920.2022.9859621 http://dx.doi.org/10.1109/ICME52920.2022.9859621 ］

Frank J and Schönherr L . 2021 . WaveFake： a data set to facilitate audio deepfake detection //Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 （NeurIPS Datasets and Benchmarks 2021）. ［s.l.］：［s.n.］

Fu Q C ， Teng Z W ， White J ， Powell M E and Schmidt D C . 2022 . FastAudio： a learnable audio front-end for spoof speech detection // Proceedings of 2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Singapore， Singapore ： IEEE： 3693 - 3697 ［ DOI： 10.1109/ICASSP43922.2022.9746722 http://dx.doi.org/10.1109/ICASSP43922.2022.9746722 ］

Ge W Y ， Panariello M ， Patino J ， Todisco M and Evans N . 2021a . Partially-connected differentiable architecture search for deepfake and spoofing detection // Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association . Brno， Czechia ： ISCA： 4319 - 4323 ［ DOI： 10.21437/Interspeech.2021-1187 http://dx.doi.org/10.21437/Interspeech.2021-1187 ］

Ge W Y ， Patino J ， Todisco M and Evans N . 2021b . Raw differentiable architecture search for speech deepfake and spoofing detection ［EB/OL］. ［ 2023-06-30 ］. http://arxiv.org/pdf/2107.12212.pdf http://arxiv.org/pdf/2107.12212.pdf

Gomez-Alanis A ， Peinado A M ， Gonzalez J A and Gomez A M . 2019 . A light convolutional GRU-RNN deep feature extractor for ASV spoofing detection // Proceedings of the 20th Annual Conference of the International Speech Communication Association . Graz， Austria ： ISCA： 1068 - 1072 ［ DOI： 10.21437/Interspeech.2019-2212 http://dx.doi.org/10.21437/Interspeech.2019-2212 ］

Gong Y ， Yang J ， Huber J ， MacKnight M and Poellabauer C . 2019 . ReMASC： realistic replay attack corpus for voice controlled systems // Proceedings of Interspeech 2019 ， the 20th Annual Conference of the International Speech Communication Association. Graz， Austria ： ISCA： 2355 - 2359 ［ DOI： 10.21437/Interspeech.2019-1541 http://dx.doi.org/10.21437/Interspeech.2019-1541 ］

Griffin D and Lim J . 1984 . Signal estimation from modified short-time Fourier transform . IEEE Transactions on Acoustics， Speech， and Signal Processing ， 32 （ 2 ）： 236 - 243 ［ DOI： 10.1109/TASSP.1984.1164317 http://dx.doi.org/10.1109/TASSP.1984.1164317 ］

Guo H J ， Liu C R ， Ishi C T and Ishiguro H . 2023 . QuickVC： any-to-many voice conversion using inverse short-time fourier transform for faster conversion ［EB/OL］. ［ 2023-06-30 ］. https://arxiv.org/pdf/2302.08296v4.pdf https://arxiv.org/pdf/2302.08296v4.pdf

Gupta P ， Chodingala P K and Patil H A . 2022 . Energy separation based instantaneous frequency estimation from quadrature and in-phase components for replay spoof detection // Proceedings of the 30th European Signal Processing Conference （EUSIPCO） . Belgrade， Serbia ： IEEE： 369 - 373 ［ DOI： 10.23919/EUSIPCO55093.2022.9909533 http://dx.doi.org/10.23919/EUSIPCO55093.2022.9909533 ］

Gupta P and Patil H A . 2022 . Linear frequency residual cepstral features for replay spoof detection on ASVspoof 2019//Proceedings of the 30th European Signal Processing Conference （EUSIPCO） . Belgrade， Serbia ： IEEE ： 349 - 353 ［ DOI： 10.23919/EUSIPCO55093.2022.9909913 http://dx.doi.org/10.23919/EUSIPCO55093.2022.9909913 ］

Hassan F and Javed A . 2021 . Voice spoofing countermeasure for synthetic speech detection // Proceedings of 2021 International Conference on Artificial Intelligence （ICAI） . Islamabad， Pakistan ： IEEE： 209 - 212 ［ DOI： 10.1109/ICAI52203.2021.9445238 http://dx.doi.org/10.1109/ICAI52203.2021.9445238 ］

Helander E ， Virtanen T ， Nurminen J and Gabbouj M . 2010 . Voice conversion using partial least squares regression . IEEE Transactions on Audio， Speech， and Language Processing ， 18 （ 5 ）： 912 - 921 ［ DOI： 10.1109/TASL.2010.2041699 http://dx.doi.org/10.1109/TASL.2010.2041699 ］

Hsu W N ， Zhang Y ， Weiss R J ， Zen H G ， Wu Y H ， Wang Y X ， Cao Y ， Jia Y ， Chen Z F ， Shen J ， Nguyen P and Pang R M . 2018 . Hierarchical generative modeling for controllable speech synthesis // Proceedings of the 7th International Conference on Learning Representations . New Orleans， USA ： ICLR

Hu C L ， Zhou R H and Yuan Q S . 2023 . Replay speech detection based on dual-input hierarchical fusion network . Applied Sciences ， 13 （ 9 ）： # 5350 ［ DOI： 10.3390/app13095350 http://dx.doi.org/10.3390/app13095350 ］

Hua G ， Teoh A B J and Zhang H J . 2021 . Towards end-to-end synthetic speech detection . IEEE Signal Processing Letters ， 28 ： 1265 - 1269 ［ DOI： 10.1109/LSP.2021.3089437 http://dx.doi.org/10.1109/LSP.2021.3089437 ］

Huang W C ， Hayashi T ， Watanabe S and Toda T . 2020 . The sequence-to-sequence baseline for the voice conversion challenge 2020： cascading ASR and TTS // Proceedings Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020 . Shanghai， China ： ISCA： 160 - 164 ［ DOI： 10.21437/VCCBC.2020-24 http://dx.doi.org/10.21437/VCCBC.2020-24 ］

Huang R J ， Lam M W Y ， Wang J ， Su D ， Yu D ， Ren Y and Zhao Z . 2022 . Fastdiff： a fast conditional diffusion model for high-quality speech synthesis // Proceedings of the 31st International Joint Conference on Artificial Intelligence Main Track . Vienna， Austria ： IJCAI： 4157 - 4163 ［ DOI： 10.24963/ijcai.2022/577 http://dx.doi.org/10.24963/ijcai.2022/577 ］

Hunt A J and Black A W . 1996 . Unit selection in a concatenative speech synthesis system using a large speech database // Proceedings of 1996 IEEE International Conference on Acoustics， Speech， and Signal Processing Conference Proceedings . Atlanta， USA ： IEEE： 373 - 376 ［ DOI： 10.1109/ICASSP.1996.541110 http://dx.doi.org/10.1109/ICASSP.1996.541110 ］

Ito A and Horiguchi S . 2023 . Spoofing attacker also benefits from self-supervised pretrained model // Proceedings of Interspeech 2023 . Dublin， Ireland ： ISCA： 5346 - 5350 ［ DOI： 10.21437/Interspeech.2023-270 http://dx.doi.org/10.21437/Interspeech.2023-270 ］

Javed A ， Malik K M ， Malik H and Irtaza A . 2022 . Voice spoofing detector： a unified anti-spoofing framework . Expert Systems with Applications ， 198 ： # 116770 ［ DOI： 10.1016/j.eswa.2022.116770 http://dx.doi.org/10.1016/j.eswa.2022.116770 ］

Jeong M ， Kim H ， Cheon S J ， Choi B J and Kim N S . 2021 . Diff-TTS： a denoising diffusion model for text-to-speech // Proceedings of Interspeech 2021 . Brno， Czechia ： ISCA： 3605 - 3609 ［ DOI： 10.21437/Interspeech.2021-469 http://dx.doi.org/10.21437/Interspeech.2021-469 ］

Jiang Z Y ， Zhu H C ， Peng L ， Ding W B and Ren Y Z . 2020 . Self-supervised spoofing audio detection scheme // Proceedings of the 21st Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 4223 - 4227 ［ DOI： 10.21437/Interspeech.2020-1760 http://dx.doi.org/10.21437/Interspeech.2020-1760 ］

Jung J W ， Heo H S ， Tak H ， Shim H J ， Chung J S ， Lee B J ， Yu H J and Evans N . 2022 . AASIST： audio anti-spoofing using integrated spectro-temporal graph attention networks // Proceedings of 2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Singapore， Singapore ： IEEE： 6367 - 6371 ［ DOI： 10.1109/ICASSP43922.2022.9747766 http://dx.doi.org/10.1109/ICASSP43922.2022.9747766 ］

Jung J W ， Kim S B ， Shim H J ， Kim J H and Yu H J . 2020 . Improved RawNet with feature map scaling for text-independent speaker verification using raw waveforms // Proceedings of the 21st Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 1496 - 1500 ［ DOI： 10.21437/Interspeech.2020-1011 http://dx.doi.org/10.21437/Interspeech.2020-1011 ］

Kalchbrenner N ， Elsen E ， Simonyan K ， Noury S ， Casagrande N ， Lockhart E ， Stimberg F ， van den Oord A ， Dieleman S and Kavukcuoglu K . 2018 . Efficient neural audio synthesis // Proceedings of the 35th International Conference on Machine Learning . Stockholm， Sweden ： PMLR： 2410 - 2419

Kamble M R ， Sailor H B ， Patil H A and Li H Z . 2020 . Advances in anti-spoofing： from the perspective of ASVspoof challenges . APSIPA Transactions on Signal and Information Processing ， 9 （ 1 ）： # 21 ［ DOI： 10.1017/ATSIP.2019.21 http://dx.doi.org/10.1017/ATSIP.2019.21 ］

Kameoka H ， Kaneko T ， Tanaka K and Hojo N . 2018 . StarGAN-VC： non-parallel many-to-many voice conversion using star generative adversarial networks // 2018 IEEE Spoken Language Technology Workshop （SLT） . Athens， Greece ： IEEE： 266 - 273 ［ DOI： 10.1109/SLT.2018.8639535 http://dx.doi.org/10.1109/SLT.2018.8639535 ］

Kameoka H ， Kaneko T ， Tanaka K and Hojo N . 2019 . ACVAE-VC： non-parallel voice conversion with auxiliary classifier variational autoencoder . IEEE/ACM Transactions on Audio， Speech， and Language Processing ， 27 （ 9 ）： 1432 - 1443 ［ DOI： 10.1109/TASLP.2019.2917232 http://dx.doi.org/10.1109/TASLP.2019.2917232 ］

Kameoka H ， Tanaka K ， Kwaśny D ， Kaneko T and Hojo N . 2020 . ConvS2S-VC： fully convolutional sequence-to-sequence voice conversion . IEEE/ACM Transactions on Audio， Speech， and Language Processing ， 28 ： 1849 - 1863 ［ DOI： 10.1109/TASLP.2020.3001456 http://dx.doi.org/10.1109/TASLP.2020.3001456 ］

Kaneko T and Kameoka H . 2018 . CycleGAN-VC： non-parallel voice conversion using Cycle-consistent adversarial networks // Proceedings of the 26th European Signal Processing Conference （EUSIPCO） . Roma， Italy ： IEEE： 2100 - 2104 ［ DOI： 10.23919/EUSIPCO.2018.8553236 http://dx.doi.org/10.23919/EUSIPCO.2018.8553236 ］

Kaneko T ， Kameoka H ， Tanaka K and Hojo N . 2019a . CycleGAN-VC2： improved CycleGan-based non-parallel voice conversion // Proceedings of 2019 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Brighton， UK ： IEEE： 6820 - 6824 ［ DOI： 10.1109/ICASSP.2019.8682897 http://dx.doi.org/10.1109/ICASSP.2019.8682897 ］

Kaneko T ， Kameoka H ， Tanaka K and Hojo N . 2019b . StarGAN-VC2： rethinking conditional methods for StarGAN-based voice conversion // Proceedings of the 20th Annual Conference of the International Speech Communication Association . Graz， Austria ： ISCA： 679 - 683 ［ DOI： 10.21437/Interspeech.2019-2236 http://dx.doi.org/10.21437/Interspeech.2019-2236 ］

Kaneko T ， Kameoka H ， Tanaka K and Hojo N . 2020 . CycleGAN-VC3： examining and improving CycleGan-VCs for mel-spectrogram conversion // Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 2017 - 2021

Kang W H ， Alam J and Fathan A . 2021 . CRIM’s system description for the ASVspoof2021 challenge //Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 100 - 106 ［ DOI： 10.21437/ASVSPOOF.2021-16 http://dx.doi.org/10.21437/ASVSPOOF.2021-16 ］

Kawahara H . 2006 . STRAIGHT， exploitation of the other aspect of VOCODER： perceptually isomorphic decomposition of speech sounds . Acoustical Science and Technology ， 27 （ 6 ）： 349 - 353 ［ DOI： 10.1250/ast.27.349 http://dx.doi.org/10.1250/ast.27.349 ］

Khanjani Z ， Watson G and Janeja V P . 2023 . Audio deepfakes： a survey . Frontiers in Big Data ， 5 ： # 1001063 ［ DOI： 10.3389/fdata.2022.1001063 http://dx.doi.org/10.3389/fdata.2022.1001063 ］

Kim J ， Kim S ， Kong J and Yoon S . 2020 . Glow-TTS： a generative flow for text-to-speech via monotonic alignment search // Proceedings of the 33rd Advances in Neural Information Processing Systems . 33 ： 8067 - 8077

Kim J ， Kong J and Son J . 2021 . Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech //Proceedings of the 38th International Conference on Machine Learning. ［s.l.］： PMLR： 5530 - 5540

Kingma D P and Dhariwal P . 2018 . Glow： generative flow with invertible 1 × 1 convolutions//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal， Canada： Curran Associates Inc.： 10236 - 10245 ［ DOI： 10.5555/3327546.3327685 http://dx.doi.org/10.5555/3327546.3327685 ］

Kinnunen T ， Lee K A ， Delgado H ， Evans N W D ， Todisco M ， Sahidullah M ， Yamagishi J and Reynolds D A . 2019 . t-DCF： a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification // 2018 Speaker and Language Recognition Workshop ， Odyssey 2018 . Les Sables d’Olonne， France： 312 - 319 ［ DOI： 10.21437/Odyssey.2018-44 http://dx.doi.org/10.21437/Odyssey.2018-44 ］

Kinnunen T ， Sahidullah M ， Delgado H ， Todisco M ， Evans N W D ， Yamagishi J and Lee K A . 2017 . The ASVspoof 2017 challenge： assessing the limits of replay spoofing attack detection // Proceedings of the 18th Interspeech Annual Conference of the International Speech Communication Association . Stockholm， Sweden ： ISCA： 2 - 6 ［ DOI： 10.21437/Interspeech.2017-1111 http://dx.doi.org/10.21437/Interspeech.2017-1111 ］

Kong J ， Kim J and Bae J . 2020 . Hifi-GAN： generative adversarial networks for efficient and high fidelity speech synthesis // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver， Canada ： Curran Associates Inc.： 17022 - 17033 ［ DOI： 10.5555/3495724.3497152 http://dx.doi.org/10.5555/3495724.3497152 ］

Kong Z F ， Ping W ， Huang J J ， Zhao K X and Catanzaro B . 2021 . DiffWave： a versatile diffusion model for audio synthesis // Proceedings of the 9th International Conference on Learning Representations . Virtual Event ： ICLR

Kwak I Y ， Kwag S ， Lee J ， Huh J H ， Lee C H ， Jeon Y ， Hwang J and Yoon J W . 2021 . ResMax： detecting voice spoofing attacks with residual network and max feature map // Proceedings of the 25th International Conference on Pattern Recognition （ICPR） . Milan， Italy ： IEEE： 4837 - 4844 ［ DOI： 10.1109/ICPR48806.2021.9412165 http://dx.doi.org/10.1109/ICPR48806.2021.9412165 ］

Le M ， Vyas A ， Shi B W ， Karrer B ， Sari L ， Moritz R ， Williamson M ， Manohar V ， Adi Y ， Mahadeokar J and Hsu W N . 2023 . Voicebox： text-guided multilingual universal speech generation at scale ［EB/OL］. ［ 2023-08-28 ］. http://arxiv.org/pdf/2306.15687.pdf http://arxiv.org/pdf/2306.15687.pdf

Lee S G ， Ping W ， Ginsburg B ， Catanzaro B and Yoon S ， 2023 . BigVGAN： a universal neural vocoder with large-scale training // Proceedings of the 11th International Conference on Learning Representations . Kigali， Rwanda ： ICLR

Lei Y ， Huo X ， Jiao Y Z and Li Y K ， 2021 . Deep metric learning for replay attack detection //Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 42 - 46 ［ DOI： 10.21437/ASVSPOOF.2021-7 http://dx.doi.org/10.21437/ASVSPOOF.2021-7 ］

Lei Y ， Yang S ， Cong J ， Xie L and Su D . 2022 . Glow-WaveGAN 2： high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion // Proceedings of the 23rd Interspeech Annual Conference of the International Speech Communication Association . Incheon， Korea（South）： ISCA： 2563 - 2567 ［ DOI： 10.21437/Interspeech.2022-684 http://dx.doi.org/10.21437/Interspeech.2022-684 ］

Lei Z C ， Yang Y G ， Liu C H and Ye J H . 2020 . Siamese convolutional neural network using Gaussian probability feature for spoofing speech detection // Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 1116 - 1120 ［ DOI： 10.21437/Interspeech.2020-2723 http://dx.doi.org/10.21437/Interspeech.2020-2723 ］

Li J L ， Wang H X ， He P S ， Abdullahi S M and Li B . 2022 . Long-term variable Q transform： a novel time-frequency transform algorithm for synthetic speech detection . Digital Signal Processing ， 120 ： # 103256 ［ DOI： 10.1016/j.dsp.2021.103256 http://dx.doi.org/10.1016/j.dsp.2021.103256 ］

Li N H ， Liu S J ， Liu Y Q ， Zhao S and Liu M . 2019 . Neural speech synthesis with Transformer network // Proceedings of the 33rd AAAI Conference on Artificial Intelligence . Honolulu， USA ： AAAI Press： 6706 - 6713 ［ DOI： 10.1609/aaai.v33i01.33016706 http://dx.doi.org/10.1609/aaai.v33i01.33016706 ］

Li T L ， Liu Y C ， Hu C X and Zhao H . 2021a . CVC： contrastive learning for non-parallel voice conversion // Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association . Brno， Czechia ： ISCA： 1324 - 1328 ［ DOI： 10.21437/Interspeech.2021-137 http://dx.doi.org/10.21437/Interspeech.2021-137 ］

Li X ， Li N ， Weng C ， Liu X Y ， Su D ， Yu D and Meng H L . 2021b . Replay and synthetic speech detection with Res2Net architecture // Proceedings of 2021 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Toronto， Canada ： IEEE： 6354 - 6358 ［ DOI： 10.1109/ICASSP39728.2021.9413828 http://dx.doi.org/10.1109/ICASSP39728.2021.9413828 ］

Li X L ， Yu N H ， Zhang X P ， Zhang W M ， Li B ， Lu W ， Wang W and Liu X L . 2021 . Overview of digital media forensics technology . Journal of Image and Graphics ， 26 （ 6 ）： 1216 - 1226

李晓龙，俞能海，张新鹏，张卫明，李斌，卢伟，王伟，刘晓龙 . 2021 . 数字媒体取证技术综述 . 中国图象图形学报， 26 （ 6 ）： 1216 - 1226 ［ DOI： 10.11834/jig.210081 http://dx.doi.org/10.11834/jig.210081 ］

Lian Z ， Wen Z Q ， Zhou X Y ， Pu S B ， Zhang S K and Tao J H . 2020 . ARVC： an auto-regressive voice conversion system without parallel training data // Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 4706 - 4710 ［ DOI： 10.21437/Interspeech.2020-1715 http://dx.doi.org/10.21437/Interspeech.2020-1715 ］

Lin J H ， Lin Y Y ， Chien C M and Lee H Y . 2021b . S2VC： a framework for any-to-any voice conversion with self-supervised pretrained representations // Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association . Brno， Czechia ： ISCA： 836 - 840 ［ DOI： 10.21437/Interspeec.2021-1356 http://dx.doi.org/10.21437/Interspeec.2021-1356 ］

Lin Y Y ， Chien C M ， Lin J H ， Lee H Y and Lee L S . 2021a . FragmentVC： any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention // Proceedings of 2021 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Toronto， Canada ： IEEE： 5939 - 5943 ［ DOI： 10.1109/ICASSP39728.2021.9413699 http://dx.doi.org/10.1109/ICASSP39728.2021.9413699 ］

Liu R ， Zhang J H ， Gao G L and Li H Z . 2023a . Betray oneself： a novel audio deepfake detection model via mono-to-stereo conversion ［EB/OL］. ［ 2023-06-30 ］. https://arxiv.org/pdf/2305.16353v1.pdf https://arxiv.org/pdf/2305.16353v1.pdf

Liu X C ， Sahidullah M ， Lee K A and Kinnunen T . 2023b . Speaker-aware anti-spoofing // Proceedings of Interspeech 2023 ， the Annual Conference of the International Speech Communication Association. Dublin， Ireland ： ISCA： 2498 - 2502 ［ DOI： 10.21437/Interspeech.2023-1323 http://dx.doi.org/10.21437/Interspeech.2023-1323 ］

Liu Z J ， Guo Y W and Yu K . 2023c . DiffVoice： text-to-speech with latent diffusion // Proceedings of 2023 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Rhodes Island， Greece ： IEEE： 1 - 5 ［ DOI： 10.1109/ICASSP49357.2023.10095100 http://dx.doi.org/10.1109/ICASSP49357.2023.10095100 ］

Luo R Q ， Tan X ， Wang R ， Qin T ， Li J Z ， Zhao S ， Chen E H and Liu T Y . 2021 . Lightspeech： lightweight and fast text to speech with neural architecture search // Proceedings of 2021 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Toronto， Canada ： IEEE： 5699 - 5703 ［ DOI： 10.1109/ICASSP39728.2021.9414403 http://dx.doi.org/10.1109/ICASSP39728.2021.9414403 ］

Ma H X ， Yi J Y ， Tao J H ， Bai Y ， Tian Z K and Wang C L . 2021a . Continual learning for fake audio detection // Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association . Brno， Czechia ： ISCA： 886 - 890 ［ DOI： 10.21437/Interspeech.2021-794 http://dx.doi.org/10.21437/Interspeech.2021-794 ］

Ma H X ， Yi J Y ， Wang C L ， Yan X R ， Tao J H ， Wang T ， Wang S M ， Xu L and Fu R B . 2022 . FAD ： a Chinese dataset for fake audio detection// Proceedings of the 36th Conference on Neural Information Processing Systems （NeurIPS 2022 ）. ［s.l.］： Zenodo： #6635521 ［ DOI： 10.5281/zenodo.6635521 http://dx.doi.org/10.5281/zenodo.6635521 ］

Ma K J ， Feng Y F ， Chen B J and Zhao G Y . 2023a . End-to-end dual-branch network towards synthetic speech detection . IEEE Signal Processing Letters ， 30 ： 359 - 363 ［ DOI： 10.1109/LSP.2023.3262419 http://dx.doi.org/10.1109/LSP.2023.3262419 ］

Ma Y X ， Ren Z Z and Xu S G . 2021b . RW-ResNet： a novel speech anti-spoofing model using raw waveform // Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association . Brno， Czechia ： ISCA： 4144 - 4148 ［ DOI： 10.21437/Interspeech.2021-438 http://dx.doi.org/10.21437/Interspeech.2021-438 ］

Ma X Y ， Zhang S S ， Huang S ， Gao J ， Hu Y and He L . 2023b . How to boost anti-spoofing with X-vectors // Proceedings of 2022 IEEE Spoken Language Technology Workshop （SLT） . Doha， Qatar ： IEEE： 593 - 598 ［ DOI： 10.1109/SLT54892.2023.10022504 http://dx.doi.org/10.1109/SLT54892.2023.10022504 ］

Mandalapu H ， Ramachandra R and Busch C . 2021 . Smartphone audio replay attacks dataset // Proceedings of 2021 IEEE International Workshop on Biometrics and Forensics （IWBF） . Rome， Italy ： IEEE： 1 - 6 ［ DOI： 10.1109/IWBF50991.2021.9465096 http://dx.doi.org/10.1109/IWBF50991.2021.9465096 ］

Martín-Doñas J M and Álvarez A . 2022 . The vicomtech audio deepfake detection system based on Wav2vec2 for the 2022 ADD Challenge // Proceedings of ICASSP 2022-2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Singapore， Singapore ： IEEE： 9241 - 9245 ［ DOI： 10.1109/ICASSP43922.2022.9747768 http://dx.doi.org/10.1109/ICASSP43922.2022.9747768 ］

Mittal A and Dua M . 2022 . Automatic speaker verification systems and spoof detection techniques： review and analysis . International Journal of Speech Technology ， 25 （ 1 ）： 105 - 134 ［ DOI： 10.1007/s10772-021-09876-2 http://dx.doi.org/10.1007/s10772-021-09876-2 ］

Mohammadi S H . 2015 . Reducing one-to-many problem in voice conversion by equalizing the formant locations using dynamic frequency warping ［EB/OL］. ［ 2023-08-28 ］. http://arxiv.org/pdf/1510.04205.pdf http://arxiv.org/pdf/1510.04205.pdf

Morise M ， Yokomori F and Ozawa K . 2016 . WORLD： a vocoder-based high-quality speech synthesis system for real-time applications . IEICE Transactions on Information and Systems ， 99 （ 7 ）： 1877 - 1884 ［ DOI： 10.1587/transinf.2015EDP7457 http://dx.doi.org/10.1587/transinf.2015EDP7457 ］

Müller N ， Dieckmann F ， Czempin P ， Canals R ， Böttinger K and Williams J . 2021 . Speech is silver， silence is golden： what do asvspoof-trained models really learn? //Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 55 - 60 ［ DOI： 10.21437/ASVSPOOF.2021-9 http://dx.doi.org/10.21437/ASVSPOOF.2021-9 ］

Müller N ， Czempin P ， Dieckmann F ， Froghyar A and Böttinger K . 2022 . Does audio deepfake detection generalize? // Proceedings of the 23rd Interspeech Annual Conference of the International Speech Communication Association . Incheon， Korea （South）： ISCA： 2783 - 2787 ［ DOI： 10.21437/Interspeech.2022-108 http://dx.doi.org/10.21437/Interspeech.2022-108 ］

Nguyen B and Cardinaux F . 2022 . NVC-Net： end-to-end adversarial voice conversion // Proceedings of 2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Singapore， Singapore ： IEEE： 7012 - 7016 ［ DOI： 10.1109/ICASSP43922.2022.9747020 http://dx.doi.org/10.1109/ICASSP43922.2022.9747020 ］

OpenAI . 2023 . GPT-4 technical report ［EB/OL］. ［ 2023-08-28 ］. http://arxiv.org/pdf/2303.08774.pdf http://arxiv.org/pdf/2303.08774.pdf

Park D S ， Chan W ， Zhang Y ， Chiu C C ， Zoph B ， Cubuk E D and Le Q V . 2019 . SpecAugment： a simple data augmentation method for automatic speech recognition // Proceedings of the 20th Annual Conference of the International Speech Communication Association . Graz， Austria ： ISCA： 2613 - 2617 ［ DOI： 10.21437/Interspeech.2019-2680 http://dx.doi.org/10.21437/Interspeech.2019-2680 ］

Park S W ， Kim D Y and Joe M C . 2020 . Cotatron： transcription-guided speech encoder for any-to-many voice conversion without parallel data // Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 4696 - 4700 ［ DOI： 10.21437/Interspeech.2020-1542 http://dx.doi.org/10.21437/Interspeech.2020-1542 ］

Peng K N ， Ping W ， Song Z and Zhao K X . 2020 . Non-autoregressive neural text-to-speech // Proceedings of the 37th International Conference on Machine Learning ， ICML 2020. ［s.l.］： PMLR ： 7586 - 7598

Ping W ， Peng K N ， Gibiansky A ， Arik S Ö ， Kannan A ， Narang S ， Raiman J and Miller J . 2017 . Deep Voice 3： scaling text-to-speech with convolutional sequence learning // Proceedings of the 6th International Conference on Learning Representations . Vancouver， Canada ： ICLR

Prenger R ， Valle R and Catanzaro B . 2019 . Waveglow： a flow-based generative network for speech synthesis // Proceedings of 2019 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Brighton， UK ： IEEE： 3617 - 3621 ［ DOI： 10.1109/ICASSP.2019.8683143 http://dx.doi.org/10.1109/ICASSP.2019.8683143 ］

Qian K Z ， Zhang Y ， Chang S Y ， Yang X S and Hasegawa-Johnson M ， 2019 . AutoVC： zero-shot voice style transfer with only autoencoder loss // Proceedings of the 36th International Conference on Machine Learning . Long Beach， USA ： PMLR： 5210 - 5219

Qian Y ， Fan Y C ， Hu W P and Soong F K . 2014 . On the training aspects of deep neural network （DNN） for parametric TTS synthesis // Proceedings of 2014 IEEE International Conference on Acoustics， Speech and Signal Processing . Florence， Italy ： IEEE： 3829 - 3833 ［ DOI： 10.1109/ICASSP.2014.6854318 http://dx.doi.org/10.1109/ICASSP.2014.6854318 ］

Ranjan R ， Vatsa M and Singh R . 2022 . STATNet： spectral and temporal features based multi-task network for audio spoofing detection // Proceedings of 2022 IEEE International Joint Conference on Biometrics （IJCB） . Abu Dhabi， United Arab Emirates ： IEEE： 1 - 9 ［ DOI： 10.1109/IJCB54206.2022.10007949 http://dx.doi.org/10.1109/IJCB54206.2022.10007949 ］

Ranjan R ， Vatsa M and Singh R . 2023 . Uncovering the deceptions： an analysis on audio spoofing detection and future prospects // Proceedings of the 32nd International Joint Conference on Artificial Intelligence ， IJCAI 2023. Macao， China ： IJCAI： 6750 - 6758 ［ DOI： 10.24963/2JCAI.2023/756 http://dx.doi.org/10.24963/2JCAI.2023/756 ］

Reimao R and Tzerpos V . 2019 . FoR： a dataset for synthetic speech detection // Proceedings of 2019 International Conference on Speech Technology and Human-Computer Dialogue （SpeD） . Timisoara， Romania ： IEEE： 1 - 10 ［ DOI： 10.1109/SPED.2019.8906599 http://dx.doi.org/10.1109/SPED.2019.8906599 ］

Ren Y ， Hu C X ， Tan X ， Qin T ， Zhao S ， Zhao Z and Liu T Y . 2022 . FastSpeech 2： fast and high-quality end-to-end text to speech // Proceedings of the 9th International Conference on Learning Representations . Virtual Event ： ICLR

Ren Y ， Ruan Y J ， Tan X ， Qin T ， Zhao S ， Zhao Z and Liu T Y . 2019 . FastSpeech ： fast， robust and controllable text to speech// Proceedings of the 33rd International Conference on Neural Information Processing Systems . Vancouver， Canada ： Curran Associates Inc.： 3171 - 3180 ［ DOI： 10.5555/3454287.3454572 http://dx.doi.org/10.5555/3454287.3454572 ］

Ren Y Z ， Liu C Y ， Liu W Y and Wang L N . 2021 . A survey on speech forgery and detection . Journal of Signal Processing ， 37 （ 12 ）： 2412 - 2439

任延珍，刘晨雨，刘武洋，王丽娜 . 2021 . 语音伪造及检测技术研究综述 . 信号处理， 37 （ 12 ）： 2412 - 2439 ［ DOI： 10.16798/j.issn.1003-0530.2021.12.011 http://dx.doi.org/10.16798/j.issn.1003-0530.2021.12.011 ］

Rostami A M ， Homayounpour M M and Nickabadi A . 2021 . Efficient attention branch network with combined loss function for automatic speaker verification spoof detection . Circuits， Systems， and Signal Processing ， 42 （ 7 ）： 4252 - 4270 ［ DOI： 10.1007/s00034-023-02314-5 http://dx.doi.org/10.1007/s00034-023-02314-5 ］

Sahidullah M ， Delgado H ， Todisco M ， Kinnunen T ， Evans N ， Yamagishi J and Lee K A . 2019 . Introduction to voice presentation attack detection and recent advances //Marcel S， Nixon M S， Fierrez J and Evans N， eds. Handbook of Biometric Anti-Spoofing . Cham， Germany ： Springer： 321 - 361 ［ DOI： 10.1007/978-3-319-92627-8_15 http://dx.doi.org/10.1007/978-3-319-92627-8_15 ］

Saito D ， Yamamoto K ， Minematsu N and Hirose K . 2011 . One-to-many voice conversion based on tensor representation of speaker space // Proceedings of the 12th Annual Conference of the International Speech Communication Association . Florence， Italy ： ISCA： 653 - 656 ［ DOI： 10.21437/Interspeech.2011-268 http://dx.doi.org/10.21437/Interspeech.2011-268 ］

Serr􀅣 J ， Pascual S and Segura C . 2019 . Blow： a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion // Proceedings of the 33rd Conference on Neural Information Processing Systems . Vancouver， Canada ： NIPS： 6790 - 6800

Shen J ， Pang R M ， Weiss R J ， Schuster M ， Jaitly N ， Yang Z H ， Chen Z F ， Zhang Y ， Wang Y X ， Skerrv-Ryan R ， Saurous R A ， Agiomvrgiannakis Y and Wu Y H . 2018 . Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions // Proceedings of 2018 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Calgary， Canada ： IEEE： 4779 - 4783 ［ DOI： 10.1109/ICASSP.2018.8461368 http://dx.doi.org/10.1109/ICASSP.2018.8461368 ］

Shim H J ， Heo H S ， Jung J W and Yu H J . 2019 . Self-supervised pre-training with acoustic configurations for replay spoofing detection // Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 1091 - 1095 ［ DOI： 10.21437/Interspeech.2020-1345 http://dx.doi.org/10.21437/Interspeech.2020-1345 ］

Song E ， Yamamoto R ， Hwang M J ， Kim J S ， Kwon O and Kim J M . 2021 . Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss // Proceedings of 2021 IEEE Spoken Language Technology Workshop （SLT） . Shenzhen， China ： IEEE： 470 - 476 ［ DOI： 10.1109/SLT48900.2021.9383549 http://dx.doi.org/10.1109/SLT48900.2021.9383549 ］

Stylianou Y . 2001 . Applying the harmonic plus noise model in concatenative speech synthesis . IEEE Transactions on Speech and Audio Processing ， 9 （ 1 ）： 21 - 29 ［ DOI： 10.1109/89.890068 http://dx.doi.org/10.1109/89.890068 ］

Su Z P ， Li M K ， Zhang G F ， Wu Q F ， Li M Q ， Zhang W M and Yao X . 2023 . Robust audio copy-move forgery detection using constant Q spectral sketches and GA-SVM . IEEE Transactions on Dependable and Secure Computing ， 20 （ 5 ）： 4016 - 4031 ［ DOI： 10.1109/TDSC.2022.3215280 http://dx.doi.org/10.1109/TDSC.2022.3215280 ］

Sun C Z ， Jia S ， Hou S W and Lü S W . 2023 . AI-synthesized voice detection using neural vocoder artifacts // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW） . Vancouver， Canada ： IEEE： 904 - 912 ［ DOI： 10.1109/CVPRW59228.2023.00097 http://dx.doi.org/10.1109/CVPRW59228.2023.00097 ］

Tak H ， Jung J W ， Patino J ， Kamble M ， Todisco M and Evans N . 2021a . End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection //Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 1 - 8 ［ DOI： 10.21437/ASVSPOOF.2021-1 http://dx.doi.org/10.21437/ASVSPOOF.2021-1 ］

Tak H ， Kamble M ， Patino J ， Todisco M and Evans N . 2022a . Rawboost： a raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing // Proceedings of 2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Singapore， Singapore ： IEEE： 6382 - 6386 ［ DOI： 10.1109/ICASSP43922.2022.9746213 http://dx.doi.org/10.1109/ICASSP43922.2022.9746213 ］

Tak H ， Patino J ， Todisco M ， Nautsch A ， Evans N and Larcher A . 2021b . End-to-end anti-spoofing with RawNet2//Proceedings of 2021 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Toronto， Canada ： IEEE ： 6369 - 6373 ［ DOI： 10.1109/ICASSP39728.2021.9414234 http://dx.doi.org/10.1109/ICASSP39728.2021.9414234 ］

Tak H ， Todisco M ， Wang X ， Jung J W ， Yamagishi J and Evans N . 2022b . Automatic speaker verification spoofing and deepfake detection using Wav2vec 2.0 and data augmentation // Odyssey 2022 ， The Speaker and Language Recognition Workshop （Odyssey 2022）. Beijing， China ： ISCA： 112 - 119 ［ DOI： 10.21437/Odyssey.2022-16 http://dx.doi.org/10.21437/Odyssey.2022-16 ］

Tan C B ， Hijazi M H A ， Khamis N ， Nohuddin P N E B ， Zainol Z ， Coenen F and Gani A . 2021 . A survey on presentation attack detection for automatic speaker verification systems： state-of-the-art， taxonomy， issues and future direction . Multimedia Tools and Applications ， 80 （ 21 ）： 32725 - 32762 ［ DOI： 10.1007/s11042-021-11235-x http://dx.doi.org/10.1007/s11042-021-11235-x ］

Tang H Z ， Zhang X L ， Wang J Z ， Cheng N ， Zeng Z ， Xiao E and Xiao J . 2021 . TGAVC： improving autoencoder voice conversion with text-guided and adversarial training // Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop （ASRU） . Cartagena， Colombia ： IEEE： 938 - 945 ［ DOI： 10.1109/ASRU51503.2021.9688088 http://dx.doi.org/10.1109/ASRU51503.2021.9688088 ］

Tao J H ， Fu R B ， Yi J Y ， Wang C L and Wang T . 2020 . Development and challenge of speech forgery and detection . Journal of Cyber Security ， 5 （ 2 ）： 28 - 38

陶建华，傅睿博，易江燕，王成龙，汪涛， 2020 . 语音伪造与鉴伪的发展与挑战 . 信息安全学报， 5 （ 2 ）： 28 - 38 ［ DOI： 10.19363/J.cnki.cn10-1380/tn.2020.02.03 http://dx.doi.org/10.19363/J.cnki.cn10-1380/tn.2020.02.03 ］

Teng Z W ， Fu Q C ， White J ， Powell M E and Schmidt D C . 2022 . SA-SASV： an end-to-end spoof-aggregated spoofing-aware speaker verification system // Proceedings of Interspeech 2022 ， the 23rd Annual Conference of the International Speech Communication Association. Incheon， Korea （South）： ISCA： 4391 - 4395 ［ DOI： 10.21437/Interspeech.2022-11029 http://dx.doi.org/10.21437/Interspeech.2022-11029 ］

Todisco M ， Wang X ， Vestman V ， Sahidullah M ， Delgado H ， Nautsch A ， Yamagishi J ， Evans N ， Kinnunen T H and Lee K A . 2019 . ASVspoof 2019： future horizons in spoofed and fake audio detection // Proceedings of the 20th Interspeech Annual Conference of the International Speech Communication Association . Graz， Austria ： ISCA： 1008 - 1012 ［ DOI： 10.21437/Interspeech.2019-2249 http://dx.doi.org/10.21437/Interspeech.2019-2249 ］

Tokuda K ， Nankaku Y ， Toda T ， Zen H G ， Yamagishi J and Oura K . 2013 . Speech synthesis based on hidden Markov models . Proceedings of the IEEE ， 101 （ 5 ）： 1234 - 1252 ［ DOI： 10.1109/JPROC.2013.2251852 http://dx.doi.org/10.1109/JPROC.2013.2251852 ］

Tokuda K ， Yoshimura T ， Masuko T ， Kobayashi T and Kitamura T . 2000 . Speech parameter generation algorithms for HMM-based speech synthesis // Proceedings of 2000 IEEE International Conference on Acoustics， Speech， and Signal Processing （ICASSP） . Istanbul， Turkey ： IEEE： 1315 - 1318 ［ DOI： 10.1109/ICASSP.2000.861820 http://dx.doi.org/10.1109/ICASSP.2000.861820 ］

Tomilov A ， Svishchev A ， Volkova M ， Chirkovskiy A ， Kondratev A and Lavrentyeva G . 2021 . STC antispoofing systems for the ASVspoof2021 challenge //Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 61 - 67 ［ DOI： 10.21437/ASVSPOOF.2021-10 http://dx.doi.org/10.21437/ASVSPOOF.2021-10 ］

van den Oord A ， Dieleman S ， Zen H G ， Simonyan K ， Vinyals O ， Graves A ， Kalchbrenner N ， Senior A W and Kavukcuoglu K . 2016 . WaveNet： a generative model for raw audio // The 9th ISCA Speech Synthesis Workshop . Sunnyvale， USA ： ISCA： #125

van den Oord A ， Li Y Z ， Babuschkin I ， Simonyan K ， Vinyals O ， Kavukcuoglu K ， van den Driessche G ， Lockhart E ， Cobo L C ， Stimberg F ， Casagrande N ， Grewe D ， Noury S ， Dieleman S ， Elsen E ， Kalchbrenner N ， Zen H G ， Graves A ， King H L ， Walters T ， Belov D and Hassabis D . 2018 . Parallel WaveNet： fast high-fidelity speech synthesis // Proceedings of the 35th International Conference on Machine Learning . Stockholm， Sweden ： PMLR： 3918 - 3926

Wang C L ， Yi J Y ， Tao J H ， Sun H Y ， Chen X ， Tian Z K ， Ma H X ， Fan C H and Fu R B . 2022a . Fully automated end-to-end fake audio detection // Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia . Lisboa， Portugal ： Association for Computing Machinery： 27 - 33 ［ DOI： 10.1145/3552466.3556530 http://dx.doi.org/10.1145/3552466.3556530 ］

Wang C L ， Yi J Y ， Tao J H ， Zhang C Y ， Zhang S and Chen X . 2023a . Detection of cross-dataset fake audio based on prosodic and pronunciation features ［EB/OL］. ［ 2023-06-30 ］. http://arxiv.org/pdf/2305.13700.pdf http://arxiv.org/pdf/2305.13700.pdf

Wang C L ， Yi J Y ， Tao J H ， Zhang C Y ， Zhang S ， Fu R B and Chen X . 2023b . TO-Rawnet： improving Rawnet with TCN and orthogonal regularization for fake audio detection ［EB/OL］. ［ 2023-06-30 ］. http://arxiv.org/pdf/2305.13701.pdf http://arxiv.org/pdf/2305.13701.pdf

Wang L ， Yeoh B and Ng J W . 2022b . Synthetic voice detection and audio splicing detection using SE-Res2Net-Conformer architecture // The 13th International Symposium on Chinese Spoken Language Processing （ISCSLP） . Singapore， Singapore ： IEEE： 115 - 119 ［ DOI： 10.1109/ISCSLP57327.2022.10037999 http://dx.doi.org/10.1109/ISCSLP57327.2022.10037999 ］

Wang Q Q ， Zhang X L ， Wang J Z ， Cheng N and Xiao J . 2022c . DRVC： a framework of any-to-any voice conversion with self-supervised learning // Proceedings of 2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Singapore， Singapore ： IEEE： 3184 - 3188 ［ DOI： 10.1109/ICASSP43922.2022.9747434 http://dx.doi.org/10.1109/ICASSP43922.2022.9747434 ］

Wang X M ， Qin X Y ， Zhu T L ， Wang C ， Zhang S L and Li M . 2021 . The DKU-CMRI system for the ASVspoof 2021 challenge： vocoder based replay channel response estimation //Proceedings of 2021 Editi on of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 16 - 21 ［ DOI： 10.21437/ASVSPOOF.2021-3 http://dx.doi.org/10.21437/ASVSPOOF.2021-3 ］

Wang Y X ， Skerry-Ryan R J ， Stanton D ， Wu Y H ， Weiss R J ， Jaitly N ， Yang Z H ， Xiao Y ， Chen Z F ， Bengio S ， Le Q V ， Agiomyrgiannakis Y ， Clark R and Saurous R A . 2017 . Tacotron： towards end-to-end speech synthesis // Proceedings of the 18th Interspeech Annual Conference of the International Speech Communication Association . Stockholm， Sweden ： ISCA： 4006 - 4010 ［ DOI： 10.21437/Interspeech.2017-1452 http://dx.doi.org/10.21437/Interspeech.2017-1452 ］

Wang Z Y and Hansen J H L . 2022 . Audio anti-spoofing using simple attention module and joint optimization based on additive angular margin loss and meta-learning // Proceedings of the 23rd Interspeech Annual Conference of the International Speech Communication Association . Incheon， Korea（South）： ISCA： 376 - 380 ［ DOI： 10.21437/Interspeech.2022-904 http://dx.doi.org/10.21437/Interspeech.2022-904 ］

Weiss R J ， Skerry-Ryan R J ， Battenberg E ， Mariooryad S and Kingma D P . 2021 . Wave-Tacotron： spectrogram-free end-to-end text-to-speech synthesis // Proceedings of 2021 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Toronto， Canada ： IEEE： 5679 - 5683 ［ DOI： 10.1109/ICASSP39728.2021.9413851 http://dx.doi.org/10.1109/ICASSP39728.2021.9413851 ］

Wu H B ， Liu A T and Lee H Y . 2020a . Defense for black-box attacks on anti-spoofing models by self-supervised learning // Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 3780 - 3784 ［ DOI： 10.21437/Interspeech.2020-2026 http://dx.doi.org/10.21437/Interspeech.2020-2026 ］

Wu H B ， Liu S X ， Meng H L and Lee H Y . 2020b . Defense against adversarial attacks on spoofing countermeasures of ASV // Proceedings of 2020 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Barcelona， Spain ： IEEE： 6564 - 6568 ［ DOI： 10.1109/ICASSP40776.2020.9053643 http://dx.doi.org/10.1109/ICASSP40776.2020.9053643 ］

Wu Z Z ， Das R K ， Yang J C and Li H Z . 2020c . Light convolutional neural network with feature genuinization for detection of synthetic speech attacks // Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 1101 - 1105 ［ DOI： 10.21437/Interspeech.2020-1810 http://dx.doi.org/10.21437/Interspeech.2020-1810 ］

Wu Z Z ， Kinnunen T ， Evans N ， Yamagishi J ， Hanilçi C ， Sahidullah M and Sizov A . 2015 . ASVspoof 2015： the first automatic speaker verification spoofing and countermeasures challenge // Proceedings of the 16th Annual Conference of the International Speech Communication Association . Dresden， Germany ： ISCA： 2037 - 2041 ［ DOI： 10.21437/Interspeech.2015-462 http://dx.doi.org/10.21437/Interspeech.2015-462 ］

Xu X X ， Shi L ， Chen X Q ， Lin P Y ， Lian J ， Chen J H ， Zhang Z H and Hancock E R . 2023 . Any-to-any voice conversion with multi-layer speaker adaptation and content supervision . IEEE/ACM Transactions on Audio， Speech， and Language Processing ， 31 ： 3431 - 3445 ［ DOI： 10.1109/TASLP.2023.3306716 http://dx.doi.org/10.1109/TASLP.2023.3306716 ］

Xue J ， Fan C H ， Yi J Y ， Wang C L ， Wen Z Q ， Zhang D and Lü Z . 2023 . Learning from yourself： a self-distillation method for fake speech detection // Proceedings of 2023 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Rhodes Island， Greece ： IEEE： 1 - 5 ［ DOI： 10.1109/ICASSP49357.2023.10096837 http://dx.doi.org/10.1109/ICASSP49357.2023.10096837 ］

Yadav A K S ， Bhagtani K ， Xiang Z Y ， Bestagini P ， Tubaro S and Delp E J . 2023 . DSVAE： interpretable disentangled representation for synthetic speech detection ［EB/OL］. ［ 2023-06-30 ］. http://arxiv.org/pdf/2304.03323.pdf http://arxiv.org/pdf/2304.03323.pdf

Yamagishi J ， Todisco M ， Sahidullah M ， Delgado H ， Wang X ， Evans N ， Kinnunen T ， Lee K A ， Vestman V and Nautsch A . 2019 . ASVspoof 2019： automatic speaker verification spoofing and countermeasures challenge evaluation plan ［EB/OL］. ［ 2023-10-20 ］. https://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdf https://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdf

Yamamoto R ， Song E ， Hwang M J and Kim J M . 2021 . Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators // Proceedings of 2021 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Toronto， Canada ： IEEE： 6039 - 6043 ［ DOI： 10.1109/ICASSP39728.2021.9413369 http://dx.doi.org/10.1109/ICASSP39728.2021.9413369 ］

Yamamoto R ， Song E and Kim J M . 2020 . Parallel WaveGAN： a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram // Proceedings of 2020 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Barcelona， Spain ： IEEE： 6199 - 6203 ［ DOI： 10.1109/ICASSP40776.2020.9053795 http://dx.doi.org/10.1109/ICASSP40776.2020.9053795 ］

Yan X R ， Yi J Y ， Tao J H ， Wang C L ， Zhang C Y and Fu R B . 2023 . System fingerprint recognition for deepfake audio： an initial dataset and investigation ［EB/OL］. ［ 2023-08-28 ］. http://arxiv.org/pdf/2208.10489.pdf http://arxiv.org/pdf/2208.10489.pdf

Yang J ， Lee J ， Kim Y I ， Cho H Y and Kim I . 2020 . VocGAN： a high-fidelity real-time vocoder with a hierarchically-nested adversarial network // Proceedings of the 21st InterspeechAnnual Conference of the International Speech Communication Association . Shanghai， China ： ISCA： 200 - 204 ［ DOI： 10.21437/Interspeech.2020-1238 http://dx.doi.org/10.21437/Interspeech.2020-1238 ］

Yang J C and Das R K . 2020 . Long-term high frequency features for synthetic speech detection . Digital Signal Processing ， 97 ： # 102622 ［ DOI： 10.1016/j.dsp.2019.102622 http://dx.doi.org/10.1016/j.dsp.2019.102622 ］

Yang S ， Qiao K ， Chen J ， Wang L Y and Yan B . 2022 . Overview on speech synthesis， forgery and detection technology . Computer Systems and Applications ， 31 （ 7 ）： 12 - 22

杨帅，乔凯，陈健，王林元，闫镔 . 2022 . 语音合成及伪造、鉴伪技术综述 . 计算机系统应用， 31 （ 7 ）： 12 - 22 ［ DOI： 10.15888/j.cnki.csa.008641 http://dx.doi.org/10.15888/j.cnki.csa.008641 ］

Yi J Y ， Bai Y ， Tao J H ， Ma H X ， Tian Z K ， Wang C L ， Wang T and Fu R B . 2021 . Half-Truth： a partially fake audio detection dataset // Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association . Brno， Czechia ： ISCA： 1654 - 1658 ［ DOI： 10.21437/Interspeech.2021-930 http://dx.doi.org/10.21437/Interspeech.2021-930 ］

Yi J Y ， Fu R B ， Tao J H ， Nie S ， Ma H X ， Wang C L ， Wang T ， Tian Z K ， Bai Y ， Fan C H ， Liang S ， Wang S M ， Zhang S ， Yan X R ， Xu L ， Wen Z Q and Li H Z . 2022a . ADD 2022： the first audio deep synthesis detection challenge // Proceedings of 2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Singapore， Singapore ： IEEE： 9216 - 9220 ［ DOI： 10.1109/ICASSP43922.2022.9746939 http://dx.doi.org/10.1109/ICASSP43922.2022.9746939 ］

Yi J Y ， Tao J H ， Fu R B ， Yan X R ， Wang C L ， Wang T ， Zhang C Y ， Zhang X H ， Zhao Y ， Ren Y ， Xu L ， Zhou J Z ， Gu H ， Wen Z Q ， Liang S ， Lian Z ， Nie S and Li H Z . 2023 . ADD 2023： the second audio deepfake detection challenge ［EB/OL］. ［ 2023-08-28 ］. http://arxiv.org/pdf/2305.13774.pdf http://arxiv.org/pdf/2305.13774.pdf

Yi J Y ， Wang C L ， Tao J H ， Tian Z K ， Fan C H ， Ma H X and Fu R B . 2022b . SceneFake： an initial dataset and benchmarks for scene fake audio detection ［EB/OL］. ［ 2023-06-30 ］. https://arxiv.org/pdf/2211.06073v1.pdf https://arxiv.org/pdf/2211.06073v1.pdf

Yoon S and Yu H J . 2021 . Multiple-point input and time-inverted speech signal for the ASVspoof 2021 Challenge //Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： ISCA： 37 - 41 ［ DOI： 10.21437/ASVSPOOF.2021-6 http://dx.doi.org/10.21437/ASVSPOOF.2021-6 ］

Zeinali H ， Stafylakis T ， Athanasopoulou G ， Rohdin J ， Gkinis I ， Burget L and Cernocký J H . 2019 . Detecting spoofing attacks using VGG and SincNet： BUT-omilia submission to ASVspoof 2019 challenge // Proceedings of Interspeech the 20th Annual Conference of the International Speech Communication Association . Graz， Austria ： ISCA： 1073 - 1077 ［ DOI： 10.21437/Interspeech.2019-2892 http://dx.doi.org/10.21437/Interspeech.2019-2892 ］

Zen H G ， Senior A and Schuster M . 2013 . Statistical parametric speech synthesis using deep neural networks // Proceedings of 2013 IEEE International Conference on Acoustics， Speech and Signal Processing . Vancouver， Canada ： IEEE： 7962 - 7966 ［ DOI： 10.1109/ICASSP.2013.6639215 http://dx.doi.org/10.1109/ICASSP.2013.6639215 ］

Zen H G ， Tokuda K and Black A W . 2009 . Statistical parametric speech synthesis . Speech Communication ， 51 （ 11 ）： 1039 - 1064 ［ DOI： 10.1016/j.specom.2009.04.004 http://dx.doi.org/10.1016/j.specom.2009.04.004 ］

Zhang D ， Li S M ， Zhang X ， Zhan J ， Wang P Y ， Zhou Y Q and Qiu X P . 2023a . SpeechGPT： empowering large language models with intrinsic cross-modal conversational abilities // Findings of the Association for Computational Linguistics： EMNLP 2023 . Singapore， Singapore ： Association for Computational Linguistics： 15757 - 15773 ［ DOI： 10.18653/v1/2023.findings-emnlp.1055 http://dx.doi.org/10.18653/v1/2023.findings-emnlp.1055 ］

Zhang H ， Yuan T ， Chen J K ， Li X T ， Zheng R J ， Huang Y X ， Chen X J ， Gong E L ， Chen Z Y ， Hu X G ， Yu D H ， Ma Y J and Huang L . 2022a . PaddleSpeech： an easy-to-use all-in-one speech toolkit // Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies： System Demonstrations . Washington， USA ： Association for Computational Linguistics： 114 - 123 ［ DOI： 10.18653/v1/2022.naacl-demo.12 http://dx.doi.org/10.18653/v1/2022.naacl-demo.12 ］

Zhang L ， Wang X ， Cooper E ， Evans N and Yamagishi J . 2023b . The PartialSpoof database and countermeasures for the detection of short fake speech segments embedded in an utterance . IEEE/ACM Transactions on Audio， Speech， and Language Processing ， 31 ： 813 - 825 ［ DOI： 10.1109/TASLP.2022.3233236 http://dx.doi.org/10.1109/TASLP.2022.3233236 ］

Zhang L ， Wang X ， Cooper E and Yamagishi J . 2021a . Multi-task learning in utterance-level and segmental-level spoof detection //2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［s.l.］： NII Yamagishi Laboratory

Zhang X W ， Li J K ， Sun M and Zheng L L . 2020 . Speech anti-spoofing： the state of the art and prospects . Journal of Data Acquisition and Processing ， 35 （ 5 ）： 807 - 823

张雄伟，李嘉康，孙蒙，郑琳琳， 2020 . 语音欺骗检测方法的研究现状及展望 . 数据采集与处理， 35 （ 5 ）： 807 - 823 ［ DOI： 10.16337/j.1004-9037.2020.05.002 http://dx.doi.org/10.16337/j.1004-9037.2020.05.002 ］

Zhang Y ， Jiang F and Duan Z Y . 2021b . One-class learning towards synthetic voice spoofing detection . IEEE Signal Processing Letters ， 28 ： 937 - 941 ［ DOI： 10.1109/LSP.2021.3076358 http://dx.doi.org/10.1109/LSP.2021.3076358 ］

Zhang Y ， Jiang F ， Zhu G ， Chen X H and Duan Z Y . 2023c . Generalizing voice presentation attack detection to unseen synthetic attacks and channel variation //Marcel S， Fierrez J and Evans N， eds. Handbook of Biometric Anti-Spoofing： Presentation Attack Detection and Vulnerability Assessment . Singapore， Singapore ： Springer Nature： 421 - 443 ［ DOI： 10.1007/978-981-19-5288-3_15 http://dx.doi.org/10.1007/978-981-19-5288-3_15 ］

Zhang Y ， Zhu G and Duan Z Y . 2022b . A probabilistic fusion framework for spoofing aware speaker verification // Odyssey 2022： The Speaker and Language Recognition Workshop . Beijing， China ： ISCA： 77 - 84 ［ DOI： 10.21437/Odyssey.2022-11 http://dx.doi.org/10.21437/Odyssey.2022-11 ］

Zhang Y ， Zhu G ， Jiang F and Duan Z Y . 2021c . An empirical study on channel effects for synthetic voice spoofing countermeasure systems // Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association . Brno， Czechia ： ISCA： 4309 - 4313 ［ DOI： 10.21437/Interspeech.2021-1820 http://dx.doi.org/10.21437/Interspeech.2021-1820 ］

Zhang Y J ， Pan S F ， He L and Ling Z H . 2019 . Learning latent representations for style control and transfer in end-to-end speech synthesis // Proceedings of 2019 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP） . Brighton， UK ： IEEE： 6945 - 6949 ［ DOI： 10.1109/ICASSP.2019.8683623 http://dx.doi.org/10.1109/ICASSP.2019.8683623 ］

Zhang Z Y ， Gu Y W ， Yi X W and Zhao X F . 2021d . FMFCC-A： a challenging mandarin dataset for synthetic speech detection // Digital Forensics and Watermarking-20th International Workshop ， IWDW 2021 . Beijing， China ： Springer-Verlag： 117 - 131 ［ DOI： 10.1007/978-3-030-95398-0_9 http://dx.doi.org/10.1007/978-3-030-95398-0_9 ］

Zhao Y ， Yi J Y ， Tao J H ， Wang C L ， Zhang C Y ， Wang T and Dong Y F . 2022 . EmoFake： an initial dataset for emotion fake audio detection ［EB/OL］. ［ 2023-06-30 ］. http://arxiv.org/pdf/2211.05363.pdf http://arxiv.org/pdf/2211.05363.pdf

文章被引用时，请邮件提醒。

提交

一致性约束引导的零样本三维模型分类网络

视觉基础模型研究现状与发展趋势

自监督提取光谱序列和语义信息的胆管癌显微高光谱图像分类

针对未知攻击的泛化性对抗防御技术综述

融合ViT与对比学习的面部表情识别