语音深度伪造及其检测技术研究进展
Research progress on speech deepfake and its detection techniques
- 2024年29卷第8期 页码:2236-2268
纸质出版日期: 2024-08-16
DOI: 10.11834/jig.230476
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-08-16 ,
移动端阅览
许裕雄, 李斌, 谭舜泉, 黄继武. 2024. 语音深度伪造及其检测技术研究进展. 中国图象图形学报, 29(08):2236-2268
Xu Yuxiong, Li Bin, Tan Shunquan, Huang Jiwu. 2024. Research progress on speech deepfake and its detection techniques. Journal of Image and Graphics, 29(08):2236-2268
语音深度伪造技术是利用深度学习方法进行合成或生成语音的技术。人工智能生成内容技术的快速迭代与优化,推动了语音深度伪造技术在伪造语音的自然度、逼真度和多样性等方面取得显著提升,同时也使得语音深度伪造检测技术面临着巨大挑战。本文对语音深度伪造及其检测技术的研究进展进行全面梳理回顾。首先,介绍以语音合成(speech synthesis,SS)和语音转换(voice conversion,VC)为代表的伪造技术。然后,介绍语音深度伪造检测领域的常用数据集和相关评价指标。在此基础上,从数据增强、特征提取和优化以及学习机制等处理流程的角度对现有的语音深度伪造检测技术进行分类与深入分析。具体而言,从语音加噪、掩码增强、信道增强和压缩增强等数据增强的角度来分析不同增强方式对伪造检测技术性能的影响,从基于手工特征的伪造检测、基于混合特征的伪造检测、基于端到端的伪造检测和基于特征融合的伪造检测等特征提取和优化的角度对比分析各类方法的优缺点,从自监督学习、对抗训练和多任务学习等学习机制的角度对伪造检测技术的训练方式进行探讨。最后,总结分析语音深度伪造检测技术存在的挑战性问题,并对未来研究进行展望。本文汇总的相关数据集和代码可在
https://github.com/media-sec-lab/Audio-Deepfake-Detection
https://github.com/media-sec-lab/Audio-Deepfake-Detection
访问。
Speech deepfake technology, which employs deep learning methods to synthesize or generate speech, has emerged as a critical research hotspot in multimedia information security. The rapid iteration and optimization of artificial intelligence-generated content technologies have significantly advanced speech deepfake tec
hniques. These advancements have significantly enhanced the naturalness, fidelity, and diversity of synthesized speech. However, they have also presented great challenges for speech deepfake detection technology. To address these challenges, this study comprehensively reviews recent research progress on speech deepfake generation and its detection techniques. Based on an extensive literature survey, this study first introduces the research background of speech forgery and its detection and compares and analyzes previously published reviews in this field. Second, this study provides a concise overview of speech deepfake generation, especially speech synthesis (SS) and voice conversion (VC). SS, which is commonly known as text-to-speech (TTS), analyzes text and generates speech that aligns with the provided input by applying linguistic rules for text description. Various deep models are employed in TTS, including sequence-to-sequence models, flow models, generative adversarial network models, variational auto-encoder models, and diffusion models. VC involves modifying acoustic features, such as emotion, accent, pronunciation, and speaker identity, to produce speech resembling human-like speech. VC algorithms can be categorized as single, multiple, and arbitrary target speech conversion depending on the number of target speakers. Third, this study briefly introduces commonly used datasets in speech deepfake detection and provides relevant access links to open-source datasets. This study briefly introduces two commonly used evaluation metrics in speech deepfake detection: the equal error rate and the tandem detection cost function. This study analyzes and categorizes the existing deep speech forgery detection techniques in detail. The pros and cons of different detection techniques are studied and compared in depth, focusing primarily on data processing, feature extraction and optimization, and learning mechanisms. Notably, this study summarizes the experimental results of existing detection techniques on the ASVspoo
f 2019 and 2021 datasets in tabular form. Within this context, the primary focus of this study is to investigate the generality of current detection techniques in the field of speech deepfake detection without focusing on specific forgery attack methods. Data augmentation involves a series of transformations on the original speech data. These include speech noise addition, mask enhancement, channel enhancement, and compression enhancement, each aiming to simulate complex real-world acoustic environments more effectively. Among them, one of the most common data processing methods is speech noise addition, which aims to interfere with the speech signal by adding noise to simulate the complex acoustic environment of a real scenario as much as possible. Mask enhancement is the masking operation on the time or frequency domain of speech to achieve noise suppression and enhancement of the speech signal for improving the accuracy and robustness of speech detection techniques. Transmission channel enhancement focuses on solving the problems of signal attenuation, data loss, and noise interference caused by changes in the codec and transmission channel of speech data. Compression enhancement techniques address the problem of degradation of speech quality during data compression. In particular, the main data compression methods are MP3, M4A, and OGG. From the perspective of feature extraction and optimization, speech deepfake detection can be divided into handcrafted feature-, hybrid feature-, deep feature-, and feature fusion-based methods. Handcrafted features refer to speech features extracted with the help of certain prior knowledge, which mainly include constant-Q transform, linear frequency cepstral coefficients, and Mel-spectrogram. By contrast, feature-based hybrid forgery detection methods utilize the domain knowledge provided by handcrafted features to mine richer information about speech representations through deep learning networks. End-to-end forgery detection methods directly learn feature representation and
classification models from raw speech signals, which eliminates the need for handcrafted feature extraction. This way allows the model to discover discriminative features from the input data automatically. Moreover, these detection techniques can be trained using a single feature. Alternatively, feature-level fusion forgery detection can be employed to combine multiple features, whether they are identical or different. Techniques such as weighted aggregation and feature concatenation are used for feature-level fusion. The detection techniques can capture richer speech information by fusing these features, which improves performance. For the learning mechanism, this study explores the impact of different training methods on forgery detection techniques, especially self-supervised learning, adversarial training, and multi-task learning. Self-supervised learning plays an important role in forgery detection techniques by automatically generating auxiliary targets or labels from speech data to train models. Fine-tuning the self-supervised-based pretrained model can effectively distinguish between real and forged speech. Then, adversarial training-based forgery detection enhances the robustness and generalization of the model by adding adversarial samples to the training data. In contrast to binary classification tasks, the forgery detection based on multi-task learning captures more comprehensive and useful speech feature information from different speech-related tasks by sharing the underlying feature representations. This approach improves the detection performance of the model while effectively utilizing speech training data. Although speech deepfake detection techniques have achieved excellent performance in some datasets, their performance is less satisfactory when testing speech data from natural scenarios. Analysis of the existing research shows that the main future research directions are to establish diversified speech deepfake datasets, study adversarial samples or data enhancement methods for enhancing the
robustness of speech deepfake detection techniques, establish generalized speech deepfake detection techniques, and explore interpretable speech deepfake detection techniques. The relevant datasets and code mentioned can be accessed from
https://github.com/media-sec-lab/Audio-Deepfake-Detection
https://github.com/media-sec-lab/Audio-Deepfake-Detection
.
语音深度伪造语音深度伪造检测语音合成(SS)语音转换(VC)人工智能生成内容(AIGC)自监督学习对抗训练
speech deepfakespeech deepfake detectionspeech synthesis(SS)voice conversion(VC)artificial intelligence-generated content(AIGC)self-supervised learningadversarial training
Aihara R, Takiguchi T and Ariki Y. 2013. Individuality-preserving voice conversion for articulation disorders using locality-constrained NMF//Proceedings of the 4th Workshop on Speech and Language Processing for Assistive Technologies. Grenoble, France: Association for Computational Linguistics: 3-8
Almutairi Z and Elgibreen H. 2022. A review of modern audio deepfake detection methods: challenges and future directions. Algorithms, 15(5): #155 [DOI: 10.3390/a15050155http://dx.doi.org/10.3390/a15050155]
Arif T, Javed A, Alhameed M, Jeribi F and Tahir A. 2021. Voice spoofing countermeasure for logical access attacks detection. IEEE Access, #9: 162857-162868 [DOI: 10.1109/ACCESS.2021.3133134http://dx.doi.org/10.1109/ACCESS.2021.3133134]
Arik S Ö, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y G, Li X, Miller J, Ng A, Raiman J, Sengupta S and Shoeybi M. 2017a. Deep voice: real-time neural text-to-speech//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR.org: 195-204 [DOI: 10.5555/3305381.3305402http://dx.doi.org/10.5555/3305381.3305402]
Arik S Ö, Diamos G, Gibiansky A, Miller J, Peng K N, Ping W, Raiman J and Zhou Y Q. 2017b. Deep voice 2: multi-speaker neural text-to-speech//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 2966-2974 [DOI: 10.5555/3294996.3295056http://dx.doi.org/10.5555/3294996.3295056]
Attorresi L, Salvi D, Borrelli C, Bestagini P and Tubaro S. 2022. Combining automatic speaker verification and prosody analysis for synthetic speech detection//Pattern Recognition, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges. Montréal, Canada: Springer-Verlag: 247-263 [DOI: 10.1007/978-3-031-37742-6_21http://dx.doi.org/10.1007/978-3-031-37742-6_21]
Ba Z J, Wen Q, Cheng P, Wang Y W, Lin F, Lu L and Liu Z G. 2023. Transferring audio deepfake detection capability across languages//Proceedings of 2023 ACM Web Conference. Austin, USA: ACM: 2033-2044 [DOI: 10.1145/3543507.3583222http://dx.doi.org/10.1145/3543507.3583222]
Bevinamarad P R and Shirldonkar M S. 2020. Audio forgery detection techniques: present and past review//Proceedings of the 4th International Conference on Trends in Electronics and Informatics (ICOEI) (48184). Tirunelveli, India: IEEE: 613-618 [DOI: 10.1109/ICOEI48184.2020.9143014http://dx.doi.org/10.1109/ICOEI48184.2020.9143014]
Bińkowski M, Donahue J, Dieleman S, Clark A, Elsen E, Casagrande N, Cobo L C and Simonyan K. 2019. High fidelity speech synthesis with adversarial networks//Proceedings of the 8th International Conference on Learning Representations. Addis Ababa, Ethiopia: ICLR
Cceres J, Font R, Grau T and Molina J. 2021. The biometric vox system for the ASVspoof 2021 challenge//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 68-74 [DOI: 10.21437/ASVSPOOF.2021-11http://dx.doi.org/10.21437/ASVSPOOF.2021-11]
Cai Z X and Li M. 2022. Invertible voice conversion [EB/OL]. [2023-06-30]. http://arxiv.org/pdf/2201.10687.pdfhttp://arxiv.org/pdf/2201.10687.pdf
Chen N X, Zhang Y, Zen H G, Weiss R J, Norouzi M and Chan W. 2020a. WaveGrad: estimating gradients for waveform generation//Proceedings of the 9th International Conference on Learning Representations. Virtual Event: ICLR
Chen T X, Khoury E, Phatak K and Sivaraman G. 2021a. Pindrop Labs’ submission to the ASVspoof 2021 challenge//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 89-93 [DOI: 10.21437/ASVSPOOF.2021-14http://dx.doi.org/10.21437/ASVSPOOF.2021-14]
Chen T X, Kumar A, Nagarsheth P, Sivaraman G and Khoury E. 2020b. Generalization of audio deepfake detection//The Speaker and Language Recognition Workshop (Odyssey 2020). Tokyo, Japan: ISCA: 132-137 [DOI: 10.21437/Odyssey.2020-19http://dx.doi.org/10.21437/Odyssey.2020-19]
Chen X H, Zhang Y, Zhu G and Duan Z Y. 2021b. UR channel-robust synthetic speech detection system for ASVspoof 2021//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 75-82 [DOI: 10.21437/ASVSPOOF.2021-12http://dx.doi.org/10.21437/ASVSPOOF.2021-12]
Chen Y N, Chu M, Chang E, Liu J and Liu R S. 2003. Voice conversion with smoothed GMM and MAP adaptation//Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003). Geneva, Switzerland: ISCA: 2413-2416 [DOI: 10.21437/Eurospeech.2003-664http://dx.doi.org/10.21437/Eurospeech.2003-664]
Choi S, Kwak I Y and Oh S. 2022. Overlapped frequency-distributed network: frequency-aware voice spoofing countermeasure//Proceedings of the 23rd Annual Conference of the International Speech Communication Association. Incheon, Korea (South): ISCA: 3558-3562 [DOI: 10.21437/Interspeech.2022-657http://dx.doi.org/10.21437/Interspeech.2022-657]
Chou J C and Lee H Y. 2019. One-shot voice conversion by separating speaker and content representations with instance normalization//Proceedings of Interspeech 2019, the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 664-668 [DOI: 10.21437/Interspeech.2019-2663http://dx.doi.org/10.21437/Interspeech.2019-2663]
Cong J, Yang S, Xie L and Su D. 2021. Glow-WaveGAN: learning speech representations from GAN-based variational auto-encoder for high fidelity flow-based speech synthesis//Proceedings of Interspeech 2021, the 22nd Annual Conference of the International Speech Communication Association. Brno, Czechia: 2182-2186 [DOI: 10.21437/Interspeech.2021-414http://dx.doi.org/10.21437/Interspeech.2021-414]
Cohen A, Rimon I, Aflalo E and Permuter H H. 2022. A study on data augmentation in voice anti-spoofing. Speech Communication, 141: 56-67 [DOI: 10.1016/j.specom.2022.04.005http://dx.doi.org/10.1016/j.specom.2022.04.005]
Das R K. 2021. Known-unknown data augmentation strategies for detection of logical access, physical access and speech deepfake attacks: ASVspoof 2021//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 29-36 [DOI: 10.21437/ASVSPOOF.2021-5http://dx.doi.org/10.21437/ASVSPOOF.2021-5]
Das R K, Yang J C and Li H Z. 2021. Data augmentation with signal companding for detection of logical access attacks//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 6349-6353 [DOI: 10.1109/ICASSP39728.2021.9413501http://dx.doi.org/10.1109/ICASSP39728.2021.9413501]
Delgado H, Evans N, Kinnunen T, Lee K A, Liu X C, Nautsch A, Patino J, Sahidullah M, Todisco M, Wang X and Yamagishi J. 2021. ASVspoof 2021: automatic speaker verification spoofing and countermeasures challenge evaluation plan [EB/OL]. [2023-06-30]. https://arxiv.org/pdf/2109.00535.pdfhttps://arxiv.org/pdf/2109.00535.pdf
Dhar S, Jana N D and Das S. 2023. An adaptive-learning-based generative adversarial network for one-to-one voice conversion. IEEE Transactions on Artificial Intelligence, 4(1): 92-106 [DOI: 10.1109/TAI.2022.3149858http://dx.doi.org/10.1109/TAI.2022.3149858]
Dixit A, Kaur N and Kingra S. 2023. Review of audio deepfake detection techniques: issues and prospects. Expert Systems, 40(8): #e13322 [DOI: 10.1111/exsy.13322http://dx.doi.org/10.1111/exsy.13322]
Donahue C, McAuley J and Puckette M. 2018. Adversarial audio synthesis//Proceedings of the 7th International Conference on Learning Representations. OrleansNew, USA: ICLR
Elias I, Zen H G, Shen J, Zhang Y, Jia Y, W eiss R J and Wu Y H. 2021. Parallel tacotron: non-autoregressive and controllable TTS//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 5709-5713 [DOI: 10.1109/ICASSP39728.2021.9414718http://dx.doi.org/10.1109/ICASSP39728.2021.9414718]
Ergünay S K, Khoury E, Lazaridis A and Marcel S. 2015. On the vulnerability of speaker verification to realistic voice spoofing//Proceedings of the 7th International Conference on Biometrics Theory, Applications and Systems (BTAS). Arlington, USA: IEEE: 1-6 [DOI: 10.1109/BTAS.2015.7358783http://dx.doi.org/10.1109/BTAS.2015.7358783]
Fathan A, Alam J and Kang W H. 2022. Mel-spectrogram image-based end-to-end audio deepfake detection under channel-mismatched conditions//Proceedings of 2022 IEEE International Conference on Multimedia and Expo (ICME). Taipei, China: IEEE: 1-6 [DOI: 10.1109/ICME52920.2022.9859621http://dx.doi.org/10.1109/ICME52920.2022.9859621]
Frank J and Schönherr L. 2021. WaveFake: a data set to facilitate audio deepfake detection//Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021). [s.l.]: [s.n.]
Fu Q C, Teng Z W, White J, Powell M E and Schmidt D C. 2022. FastAudio: a learnable audio front-end for spoof speech detection//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE: 3693-3697 [DOI: 10.1109/ICASSP43922.2022.9746722http://dx.doi.org/10.1109/ICASSP43922.2022.9746722]
Ge W Y, Panariello M, Patino J, Todisco M and Evans N. 2021a. Partially-connected differentiable architecture search for deepfake and spoofing detection//Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 4319-4323 [DOI: 10.21437/Interspeech.2021-1187http://dx.doi.org/10.21437/Interspeech.2021-1187]
Ge W Y, Patino J, Todisco M and Evans N. 2021b. Raw differentiable architecture search for speech deepfake and spoofing detection [EB/OL]. [2023-06-30]. http://arxiv.org/pdf/2107.12212.pdfhttp://arxiv.org/pdf/2107.12212.pdf
Gomez-Alanis A, Peinado A M, Gonzalez J A and Gomez A M. 2019. A light convolutional GRU-RNN deep feature extractor for ASV spoofing detection//Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 1068-1072 [DOI: 10.21437/Interspeech.2019-2212http://dx.doi.org/10.21437/Interspeech.2019-2212]
Gong Y, Yang J, Huber J, MacKnight M and Poellabauer C. 2019. ReMASC: realistic replay attack corpus for voice controlled systems//Proceedings of Interspeech 2019, the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 2355-2359 [DOI: 10.21437/Interspeech.2019-1541http://dx.doi.org/10.21437/Interspeech.2019-1541]
Griffin D and Lim J. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2): 236-243 [DOI: 10.1109/TASSP.1984.1164317http://dx.doi.org/10.1109/TASSP.1984.1164317]
Guo H J, Liu C R, Ishi C T and Ishiguro H. 2023. QuickVC: any-to-many voice conversion using inverse short-time fourier transform for faster conversion [EB/OL]. [2023-06-30]. https://arxiv.org/pdf/2302.08296v4.pdfhttps://arxiv.org/pdf/2302.08296v4.pdf
Gupta P, Chodingala P K and Patil H A. 2022. Energy separation based instantaneous frequency estimation from quadrature and in-phase components for replay spoof detection//Proceedings of the 30th European Signal Processing Conference (EUSIPCO). Belgrade, Serbia: IEEE: 369-373 [DOI: 10.23919/EUSIPCO55093.2022.9909533http://dx.doi.org/10.23919/EUSIPCO55093.2022.9909533]
Gupta P and Patil H A. 2022. Linear frequency residual cepstral features for replay spoof detection on ASVspoof 2019//Proceedings of the 30th European Signal Processing Conference (EUSIPCO). Belgrade, Serbia: IEEE: 349-353 [DOI: 10.23919/EUSIPCO55093.2022.9909913http://dx.doi.org/10.23919/EUSIPCO55093.2022.9909913]
Hassan F and Javed A. 2021. Voice spoofing countermeasure for synthetic speech detection//Proceedings of 2021 International Conference on Artificial Intelligence (ICAI). Islamabad, Pakistan: IEEE: 209-212 [DOI: 10.1109/ICAI52203.2021.9445238http://dx.doi.org/10.1109/ICAI52203.2021.9445238]
Helander E, Virtanen T, Nurminen J and Gabbouj M. 2010. Voice conversion using partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 18(5): 912-921 [DOI: 10.1109/TASL.2010.2041699http://dx.doi.org/10.1109/TASL.2010.2041699]
Hsu W N, Zhang Y, Weiss R J, Zen H G, Wu Y H, Wang Y X, Cao Y, Jia Y, Chen Z F, Shen J, Nguyen P and Pang R M. 2018. Hierarchical generative modeling for controllable speech synthesis//Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA: ICLR
Hu C L, Zhou R H and Yuan Q S. 2023. Replay speech detection based on dual-input hierarchical fusion network. Applied Sciences, 13(9): #5350 [DOI: 10.3390/app13095350http://dx.doi.org/10.3390/app13095350]
Hua G, Teoh A B J and Zhang H J. 2021. Towards end-to-end synthetic speech detection. IEEE Signal Processing Letters, 28: 1265-1269 [DOI: 10.1109/LSP.2021.3089437http://dx.doi.org/10.1109/LSP.2021.3089437]
Huang W C, Hayashi T, Watanabe S and Toda T. 2020. The sequence-to-sequence baseline for the voice conversion challenge 2020: cascading ASR and TTS//Proceedings Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. Shanghai, China: ISCA: 160-164 [DOI: 10.21437/VCCBC.2020-24http://dx.doi.org/10.21437/VCCBC.2020-24]
Huang R J, Lam M W Y, Wang J, Su D, Yu D, Ren Y and Zhao Z. 2022. Fastdiff: a fast conditional diffusion model for high-quality speech synthesis//Proceedings of the 31st International Joint Conference on Artificial Intelligence Main Track. Vienna, Austria: IJCAI: 4157-4163 [DOI: 10.24963/ijcai.2022/577http://dx.doi.org/10.24963/ijcai.2022/577]
Hunt A J and Black A W. 1996. Unit selection in a concatenative speech synthesis system using a large speech database//Proceedings of 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. Atlanta, USA: IEEE: 373-376 [DOI: 10.1109/ICASSP.1996.541110http://dx.doi.org/10.1109/ICASSP.1996.541110]
Ito A and Horiguchi S. 2023. Spoofing attacker also benefits from self-supervised pretrained model//Proceedings of Interspeech 2023. Dublin, Ireland: ISCA: 5346-5350 [DOI: 10.21437/Interspeech.2023-270http://dx.doi.org/10.21437/Interspeech.2023-270]
Javed A, Malik K M, Malik H and Irtaza A. 2022. Voice spoofing detector: a unified anti-spoofing framework. Expert Systems with Applications, 198: #116770 [DOI: 10.1016/j.eswa.2022.116770http://dx.doi.org/10.1016/j.eswa.2022.116770]
Jeong M, Kim H, Cheon S J, Choi B J and Kim N S. 2021. Diff-TTS: a denoising diffusion model for text-to-speech//Proceedings of Interspeech 2021. Brno, Czechia: ISCA: 3605-3609 [DOI: 10.21437/Interspeech.2021-469http://dx.doi.org/10.21437/Interspeech.2021-469]
Jiang Z Y, Zhu H C, Peng L, Ding W B and Ren Y Z. 2020. Self-supervised spoofing audio detection scheme//Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 4223-4227 [DOI: 10.21437/Interspeech.2020-1760http://dx.doi.org/10.21437/Interspeech.2020-1760]
Jung J W, Heo H S, Tak H, Shim H J, Chung J S, Lee B J, Yu H J and Evans N. 2022. AASIST: audio anti-spoofing using integrated spectro-temporal graph attention networks//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE: 6367-6371 [DOI: 10.1109/ICASSP43922.2022.9747766http://dx.doi.org/10.1109/ICASSP43922.2022.9747766]
Jung J W, Kim S B, Shim H J, Kim J H and Yu H J. 2020. Improved RawNet with feature map scaling for text-independent speaker verification using raw waveforms//Proceedings of the 21st Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 1496-1500 [DOI: 10.21437/Interspeech.2020-1011http://dx.doi.org/10.21437/Interspeech.2020-1011]
Kalchbrenner N, Elsen E, Simonyan K, Noury S, Casagrande N, Lockhart E, Stimberg F, van den Oord A, Dieleman S and Kavukcuoglu K. 2018. Efficient neural audio synthesis//Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR: 2410-2419
Kamble M R, Sailor H B, Patil H A and Li H Z. 2020. Advances in anti-spoofing: from the perspective of ASVspoof challenges. APSIPA Transactions on Signal and Information Processing, 9(1): #21 [DOI: 10.1017/ATSIP.2019.21http://dx.doi.org/10.1017/ATSIP.2019.21]
Kameoka H, Kaneko T, Tanaka K and Hojo N. 2018. StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks//2018 IEEE Spoken Language Technology Workshop (SLT). Athens, Greece: IEEE: 266-273 [DOI: 10.1109/SLT.2018.8639535http://dx.doi.org/10.1109/SLT.2018.8639535]
Kameoka H, Kaneko T, Tanaka K and Hojo N. 2019. ACVAE-VC: non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(9): 1432-1443 [DOI: 10.1109/TASLP.2019.2917232http://dx.doi.org/10.1109/TASLP.2019.2917232]
Kameoka H, Tanaka K, Kwaśny D, Kaneko T and Hojo N. 2020. ConvS2S-VC: fully convolutional sequence-to-sequence voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 1849-1863 [DOI: 10.1109/TASLP.2020.3001456http://dx.doi.org/10.1109/TASLP.2020.3001456]
Kaneko T and Kameoka H. 2018. CycleGAN-VC: non-parallel voice conversion using Cycle-consistent adversarial networks//Proceedings of the 26th European Signal Processing Conference (EUSIPCO). Roma, Italy: IEEE: 2100-2104 [DOI: 10.23919/EUSIPCO.2018.8553236http://dx.doi.org/10.23919/EUSIPCO.2018.8553236]
Kaneko T, Kameoka H, Tanaka K and Hojo N. 2019a. CycleGAN-VC2: improved CycleGan-based non-parallel voice conversion//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE: 6820-6824 [DOI: 10.1109/ICASSP.2019.8682897http://dx.doi.org/10.1109/ICASSP.2019.8682897]
Kaneko T, Kameoka H, Tanaka K and Hojo N. 2019b. StarGAN-VC2: rethinking conditional methods for StarGAN-based voice conversion//Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 679-683 [DOI: 10.21437/Interspeech.2019-2236http://dx.doi.org/10.21437/Interspeech.2019-2236]
Kaneko T, Kameoka H, Tanaka K and Hojo N. 2020. CycleGAN-VC3: examining and improving CycleGan-VCs for mel-spectrogram conversion//Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 2017-2021
Kang W H, Alam J and Fathan A. 2021. CRIM’s system description for the ASVspoof2021 challenge//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 100-106 [DOI: 10.21437/ASVSPOOF.2021-16http://dx.doi.org/10.21437/ASVSPOOF.2021-16]
Kawahara H. 2006. STRAIGHT, exploitation of the other aspect of VOCODER: perceptually isomorphic decomposition of speech sounds. Acoustical Science and Technology, 27(6): 349-353 [DOI: 10.1250/ast.27.349http://dx.doi.org/10.1250/ast.27.349]
Khanjani Z, Watson G and Janeja V P. 2023. Audio deepfakes: a survey. Frontiers in Big Data, 5: #1001063 [DOI: 10.3389/fdata.2022.1001063http://dx.doi.org/10.3389/fdata.2022.1001063]
Kim J, Kim S, Kong J and Yoon S. 2020. Glow-TTS: a generative flow for text-to-speech via monotonic alignment search//Proceedings of the 33rd Advances in Neural Information Processing Systems. 33: 8067-8077
Kim J, Kong J and Son J. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech//Proceedings of the 38th International Conference on Machine Learning. [s.l.]: PMLR: 5530-5540
Kingma D P and Dhariwal P. 2018. Glow: generative flow with invertible 1 × 1 convolutions//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc.: 10236-10245 [DOI: 10.5555/3327546.3327685http://dx.doi.org/10.5555/3327546.3327685]
Kinnunen T, Lee K A, Delgado H, Evans N W D, Todisco M, Sahidullah M, Yamagishi J and Reynolds D A. 2019. t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification//2018 Speaker and Language Recognition Workshop, Odyssey 2018. Les Sables d’Olonne, France: 312-319 [DOI: 10.21437/Odyssey.2018-44http://dx.doi.org/10.21437/Odyssey.2018-44]
Kinnunen T, Sahidullah M, Delgado H, Todisco M, Evans N W D, Yamagishi J and Lee K A. 2017. The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection//Proceedings of the 18th Interspeech Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA: 2-6 [DOI: 10.21437/Interspeech.2017-1111http://dx.doi.org/10.21437/Interspeech.2017-1111]
Kong J, Kim J and Bae J. 2020. Hifi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 17022-17033 [DOI: 10.5555/3495724.3497152http://dx.doi.org/10.5555/3495724.3497152]
Kong Z F, Ping W, Huang J J, Zhao K X and Catanzaro B. 2021. DiffWave: a versatile diffusion model for audio synthesis//Proceedings of the 9th International Conference on Learning Representations. Virtual Event: ICLR
Kwak I Y, Kwag S, Lee J, Huh J H, Lee C H, Jeon Y, Hwang J and Yoon J W. 2021. ResMax: detecting voice spoofing attacks with residual network and max feature map//Proceedings of the 25th International Conference on Pattern Recognition (ICPR). Milan, Italy: IEEE: 4837-4844 [DOI: 10.1109/ICPR48806.2021.9412165http://dx.doi.org/10.1109/ICPR48806.2021.9412165]
Le M, Vyas A, Shi B W, Karrer B, Sari L, Moritz R, Williamson M, Manohar V, Adi Y, Mahadeokar J and Hsu W N. 2023. Voicebox: text-guided multilingual universal speech generation at scale [EB/OL]. [2023-08-28]. http://arxiv.org/pdf/2306.15687.pdfhttp://arxiv.org/pdf/2306.15687.pdf
Lee S G, Ping W, Ginsburg B, Catanzaro B and Yoon S, 2023. BigVGAN: a universal neural vocoder with large-scale training//Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda: ICLR
Lei Y, Huo X, Jiao Y Z and Li Y K, 2021. Deep metric learning for replay attack detection//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 42-46 [DOI: 10.21437/ASVSPOOF.2021-7http://dx.doi.org/10.21437/ASVSPOOF.2021-7]
Lei Y, Yang S, Cong J, Xie L and Su D. 2022. Glow-WaveGAN 2: high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion//Proceedings of the 23rd Interspeech Annual Conference of the International Speech Communication Association. Incheon, Korea(South): ISCA: 2563-2567 [DOI:10.21437/Interspeech.2022-684http://dx.doi.org/10.21437/Interspeech.2022-684]
Lei Z C, Yang Y G, Liu C H and Ye J H. 2020. Siamese convolutional neural network using Gaussian probability feature for spoofing speech detection//Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 1116-1120 [DOI: 10.21437/Interspeech.2020-2723http://dx.doi.org/10.21437/Interspeech.2020-2723]
Li J L, Wang H X, He P S, Abdullahi S M and Li B. 2022. Long-term variable Q transform: a novel time-frequency transform algorithm for synthetic speech detection. Digital Signal Processing, 120: #103256 [DOI: 10.1016/j.dsp.2021.103256http://dx.doi.org/10.1016/j.dsp.2021.103256]
Li N H, Liu S J, Liu Y Q, Zhao S and Liu M. 2019. Neural speech synthesis with Transformer network//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI Press: 6706-6713 [DOI: 10.1609/aaai.v33i01.33016706http://dx.doi.org/10.1609/aaai.v33i01.33016706]
Li T L, Liu Y C, Hu C X and Zhao H. 2021a. CVC: contrastive learning for non-parallel voice conversion//Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 1324-1328 [DOI:10.21437/Interspeech.2021-137http://dx.doi.org/10.21437/Interspeech.2021-137]
Li X, Li N, Weng C, Liu X Y, Su D, Yu D and Meng H L. 2021b. Replay and synthetic speech detection with Res2Net architecture//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 6354-6358 [DOI: 10.1109/ICASSP39728.2021.9413828http://dx.doi.org/10.1109/ICASSP39728.2021.9413828]
Li X L, Yu N H, Zhang X P, Zhang W M, Li B, Lu W, Wang W and Liu X L. 2021. Overview of digital media forensics technology. Journal of Image and Graphics, 26(6): 1216-1226
李晓龙, 俞能海, 张新鹏, 张卫明, 李斌, 卢伟, 王伟, 刘晓龙. 2021. 数字媒体取证技术综述. 中国图象图形学报, 26(6): 1216-1226 [DOI: 10.11834/jig.210081http://dx.doi.org/10.11834/jig.210081]
Lian Z, Wen Z Q, Zhou X Y, Pu S B, Zhang S K and Tao J H. 2020. ARVC: an auto-regressive voice conversion system without parallel training data//Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 4706-4710 [DOI: 10.21437/Interspeech.2020-1715http://dx.doi.org/10.21437/Interspeech.2020-1715]
Lin J H, Lin Y Y, Chien C M and Lee H Y. 2021b. S2VC: a framework for any-to-any voice conversion with self-supervised pretrained representations//Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 836-840 [DOI:10.21437/Interspeec.2021-1356http://dx.doi.org/10.21437/Interspeec.2021-1356]
Lin Y Y, Chien C M, Lin J H, Lee H Y and Lee L S. 2021a. FragmentVC: any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 5939-5943 [DOI: 10.1109/ICASSP39728.2021.9413699http://dx.doi.org/10.1109/ICASSP39728.2021.9413699]
Liu R, Zhang J H, Gao G L and Li H Z. 2023a. Betray oneself: a novel audio deepfake detection model via mono-to-stereo conversion [EB/OL]. [2023-06-30]. https://arxiv.org/pdf/2305.16353v1.pdfhttps://arxiv.org/pdf/2305.16353v1.pdf
Liu X C, Sahidullah M, Lee K A and Kinnunen T. 2023b. Speaker-aware anti-spoofing//Proceedings of Interspeech 2023, the Annual Conference of the International Speech Communication Association. Dublin, Ireland: ISCA: 2498-2502 [DOI: 10.21437/Interspeech.2023-1323http://dx.doi.org/10.21437/Interspeech.2023-1323]
Liu Z J, Guo Y W and Yu K. 2023c. DiffVoice: text-to-speech with latent diffusion//Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island, Greece: IEEE: 1-5 [DOI: 10.1109/ICASSP49357.2023.10095100http://dx.doi.org/10.1109/ICASSP49357.2023.10095100]
Luo R Q, Tan X, Wang R, Qin T, Li J Z, Zhao S, Chen E H and Liu T Y. 2021. Lightspeech: lightweight and fast text to speech with neural architecture search//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 5699-5703 [DOI: 10.1109/ICASSP39728.2021.9414403http://dx.doi.org/10.1109/ICASSP39728.2021.9414403]
Ma H X, Yi J Y, Tao J H, Bai Y, Tian Z K and Wang C L. 2021a. Continual learning for fake audio detection//Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 886-890 [DOI: 10.21437/Interspeech.2021-794http://dx.doi.org/10.21437/Interspeech.2021-794]
Ma H X, Yi J Y, Wang C L, Yan X R, Tao J H, Wang T, Wang S M, Xu L and Fu R B. 2022. FAD: a Chinese dataset for fake audio detection//Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). [s.l.]: Zenodo: #6635521 [DOI: 10.5281/zenodo.6635521http://dx.doi.org/10.5281/zenodo.6635521]
Ma K J, Feng Y F, Chen B J and Zhao G Y. 2023a. End-to-end dual-branch network towards synthetic speech detection. IEEE Signal Processing Letters, 30: 359-363 [DOI: 10.1109/LSP.2023.3262419http://dx.doi.org/10.1109/LSP.2023.3262419]
Ma Y X, Ren Z Z and Xu S G. 2021b. RW-ResNet: a novel speech anti-spoofing model using raw waveform//Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 4144-4148 [DOI:10.21437/Interspeech.2021-438http://dx.doi.org/10.21437/Interspeech.2021-438]
Ma X Y, Zhang S S, Huang S, Gao J, Hu Y and He L. 2023b. How to boost anti-spoofing with X-vectors//Proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT). Doha, Qatar: IEEE: 593-598 [DOI: 10.1109/SLT54892.2023.10022504http://dx.doi.org/10.1109/SLT54892.2023.10022504]
Mandalapu H, Ramachandra R and Busch C. 2021. Smartphone audio replay attacks dataset//Proceedings of 2021 IEEE International Workshop on Biometrics and Forensics (IWBF). Rome, Italy: IEEE: 1-6 [DOI: 10.1109/IWBF50991.2021.9465096http://dx.doi.org/10.1109/IWBF50991.2021.9465096]
Martín-Doñas J M and Álvarez A. 2022. The vicomtech audio deepfake detection system based on Wav2vec2 for the 2022 ADD Challenge//Proceedings of ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE: 9241-9245 [DOI: 10.1109/ICASSP43922.2022.9747768http://dx.doi.org/10.1109/ICASSP43922.2022.9747768]
Mittal A and Dua M. 2022. Automatic speaker verification systems and spoof detection techniques: review and analysis. International Journal of Speech Technology, 25(1): 105-134 [DOI: 10.1007/s10772-021-09876-2http://dx.doi.org/10.1007/s10772-021-09876-2]
Mohammadi S H. 2015. Reducing one-to-many problem in voice conversion by equalizing the formant locations using dynamic frequency warping [EB/OL]. [2023-08-28]. http://arxiv.org/pdf/1510.04205.pdfhttp://arxiv.org/pdf/1510.04205.pdf
Morise M, Yokomori F and Ozawa K. 2016. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7): 1877-1884 [DOI: 10.1587/transinf.2015EDP7457http://dx.doi.org/10.1587/transinf.2015EDP7457]
Müller N, Dieckmann F, Czempin P, Canals R, Böttinger K and Williams J. 2021. Speech is silver, silence is golden: what do asvspoof-trained models really learn?//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 55-60 [DOI: 10.21437/ASVSPOOF.2021-9http://dx.doi.org/10.21437/ASVSPOOF.2021-9]
Müller N, Czempin P, Dieckmann F, Froghyar A and Böttinger K. 2022. Does audio deepfake detection generalize?//Proceedings of the 23rd Interspeech Annual Conference of the International Speech Communication Association. Incheon, Korea (South): ISCA: 2783-2787 [DOI: 10.21437/Interspeech.2022-108http://dx.doi.org/10.21437/Interspeech.2022-108]
Nguyen B and Cardinaux F. 2022. NVC-Net: end-to-end adversarial voice conversion//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE: 7012-7016 [DOI: 10.1109/ICASSP43922.2022.9747020http://dx.doi.org/10.1109/ICASSP43922.2022.9747020]
OpenAI. 2023. GPT-4 technical report [EB/OL]. [2023-08-28]. http://arxiv.org/pdf/2303.08774.pdfhttp://arxiv.org/pdf/2303.08774.pdf
Park D S, Chan W, Zhang Y, Chiu C C, Zoph B, Cubuk E D and Le Q V. 2019. SpecAugment: a simple data augmentation method for automatic speech recognition//Proceedings of the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 2613-2617 [DOI: 10.21437/Interspeech.2019-2680http://dx.doi.org/10.21437/Interspeech.2019-2680]
Park S W, Kim D Y and Joe M C. 2020. Cotatron: transcription-guided speech encoder for any-to-many voice conversion without parallel data//Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 4696-4700 [DOI:10.21437/Interspeech.2020-1542http://dx.doi.org/10.21437/Interspeech.2020-1542]
Peng K N, Ping W, Song Z and Zhao K X. 2020. Non-autoregressive neural text-to-speech//Proceedings of the 37th International Conference on Machine Learning, ICML 2020. [s.l.]: PMLR: 7586-7598
Ping W, Peng K N, Gibiansky A, Arik S Ö, Kannan A, Narang S, Raiman J and Miller J. 2017. Deep Voice 3: scaling text-to-speech with convolutional sequence learning//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: ICLR
Prenger R, Valle R and Catanzaro B. 2019. Waveglow: a flow-based generative network for speech synthesis//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE: 3617-3621 [DOI: 10.1109/ICASSP.2019.8683143http://dx.doi.org/10.1109/ICASSP.2019.8683143]
Qian K Z, Zhang Y, Chang S Y, Yang X S and Hasegawa-Johnson M, 2019. AutoVC: zero-shot voice style transfer with only autoencoder loss//Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: PMLR: 5210-5219
Qian Y, Fan Y C, Hu W P and Soong F K. 2014. On the training aspects of deep neural network (DNN) for parametric TTS synthesis//Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE: 3829-3833 [DOI: 10.1109/ICASSP.2014.6854318http://dx.doi.org/10.1109/ICASSP.2014.6854318]
Ranjan R, Vatsa M and Singh R. 2022. STATNet: spectral and temporal features based multi-task network for audio spoofing detection//Proceedings of 2022 IEEE International Joint Conference on Biometrics (IJCB). Abu Dhabi, United Arab Emirates: IEEE: 1-9 [DOI: 10.1109/IJCB54206.2022.10007949http://dx.doi.org/10.1109/IJCB54206.2022.10007949]
Ranjan R, Vatsa M and Singh R. 2023. Uncovering the deceptions: an analysis on audio spoofing detection and future prospects//Proceedings of the 32nd International Joint Conference on Artificial Intelligence, IJCAI 2023. Macao, China: IJCAI: 6750-6758 [DOI:10.24963/2JCAI.2023/756http://dx.doi.org/10.24963/2JCAI.2023/756]
Reimao R and Tzerpos V. 2019. FoR: a dataset for synthetic speech detection//Proceedings of 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD). Timisoara, Romania: IEEE: 1-10 [DOI: 10.1109/SPED.2019.8906599http://dx.doi.org/10.1109/SPED.2019.8906599]
Ren Y, Hu C X, Tan X, Qin T, Zhao S, Zhao Z and Liu T Y. 2022. FastSpeech 2: fast and high-quality end-to-end text to speech//Proceedings of the 9th International Conference on Learning Representations. Virtual Event: ICLR
Ren Y, Ruan Y J, Tan X, Qin T, Zhao S, Zhao Z and Liu T Y. 2019. FastSpeech: fast, robust and controllable text to speech//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 3171-3180 [DOI: 10.5555/3454287.3454572http://dx.doi.org/10.5555/3454287.3454572]
Ren Y Z, Liu C Y, Liu W Y and Wang L N. 2021. A survey on speech forgery and detection. Journal of Signal Processing, 37(12): 2412-2439
任延珍, 刘晨雨, 刘武洋, 王丽娜. 2021. 语音伪造及检测技术研究综述. 信号处理, 37(12): 2412-2439 [DOI: 10.16798/j.issn.1003-0530.2021.12.011http://dx.doi.org/10.16798/j.issn.1003-0530.2021.12.011]
Rostami A M, Homayounpour M M and Nickabadi A. 2021. Efficient attention branch network with combined loss function for automatic speaker verification spoof detection. Circuits, Systems, and Signal Processing, 42(7): 4252-4270 [DOI: 10.1007/s00034-023-02314-5http://dx.doi.org/10.1007/s00034-023-02314-5]
Sahidullah M, Delgado H, Todisco M, Kinnunen T, Evans N, Yamagishi J and Lee K A. 2019. Introduction to voice presentation attack detection and recent advances//Marcel S, Nixon M S, Fierrez J and Evans N, eds. Handbook of Biometric Anti-Spoofing. Cham, Germany: Springer: 321-361 [DOI: 10.1007/978-3-319-92627-8_15http://dx.doi.org/10.1007/978-3-319-92627-8_15]
Saito D, Yamamoto K, Minematsu N and Hirose K. 2011. One-to-many voice conversion based on tensor representation of speaker space//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy: ISCA: 653-656 [DOI: 10.21437/Interspeech.2011-268http://dx.doi.org/10.21437/Interspeech.2011-268]
Serr J, Pascual S and Segura C. 2019. Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion//Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver, Canada: NIPS: 6790-6800
Shen J, Pang R M, Weiss R J, Schuster M, Jaitly N, Yang Z H, Chen Z F, Zhang Y, Wang Y X, Skerrv-Ryan R, Saurous R A, Agiomvrgiannakis Y and Wu Y H. 2018. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions//Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, Canada: IEEE: 4779-4783 [DOI: 10.1109/ICASSP.2018.8461368http://dx.doi.org/10.1109/ICASSP.2018.8461368]
Shim H J, Heo H S, Jung J W and Yu H J. 2019. Self-supervised pre-training with acoustic configurations for replay spoofing detection//Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 1091-1095 [DOI:10.21437/Interspeech.2020-1345http://dx.doi.org/10.21437/Interspeech.2020-1345]
Song E, Yamamoto R, Hwang M J, Kim J S, Kwon O and Kim J M. 2021. Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss//Proceedings of 2021 IEEE Spoken Language Technology Workshop (SLT). Shenzhen, China: IEEE: 470-476 [DOI: 10.1109/SLT48900.2021.9383549http://dx.doi.org/10.1109/SLT48900.2021.9383549]
Stylianou Y. 2001. Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(1): 21-29 [DOI: 10.1109/89.890068http://dx.doi.org/10.1109/89.890068]
Su Z P, Li M K, Zhang G F, Wu Q F, Li M Q, Zhang W M and Yao X. 2023. Robust audio copy-move forgery detection using constant Q spectral sketches and GA-SVM. IEEE Transactions on Dependable and Secure Computing, 20(5): 4016-4031 [DOI: 10.1109/TDSC.2022.3215280http://dx.doi.org/10.1109/TDSC.2022.3215280]
Sun C Z, Jia S, Hou S W and Lü S W. 2023. AI-synthesized voice detection using neural vocoder artifacts//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Vancouver, Canada: IEEE: 904-912 [DOI: 10.1109/CVPRW59228.2023.00097http://dx.doi.org/10.1109/CVPRW59228.2023.00097]
Tak H, Jung J W, Patino J, Kamble M, Todisco M and Evans N. 2021a. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 1-8 [DOI: 10.21437/ASVSPOOF.2021-1http://dx.doi.org/10.21437/ASVSPOOF.2021-1]
Tak H, Kamble M, Patino J, Todisco M and Evans N. 2022a. Rawboost: a raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE: 6382-6386 [DOI: 10.1109/ICASSP43922.2022.9746213http://dx.doi.org/10.1109/ICASSP43922.2022.9746213]
Tak H, Patino J, Todisco M, Nautsch A, Evans N and Larcher A. 2021b. End-to-end anti-spoofing with RawNet2//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 6369-6373 [DOI: 10.1109/ICASSP39728.2021.9414234http://dx.doi.org/10.1109/ICASSP39728.2021.9414234]
Tak H, Todisco M, Wang X, Jung J W, Yamagishi J and Evans N. 2022b. Automatic speaker verification spoofing and deepfake detection using Wav2vec 2.0 and data augmentation//Odyssey 2022, The Speaker and Language Recognition Workshop (Odyssey 2022). Beijing, China: ISCA: 112-119 [DOI: 10.21437/Odyssey.2022-16http://dx.doi.org/10.21437/Odyssey.2022-16]
Tan C B, Hijazi M H A, Khamis N, Nohuddin P N E B, Zainol Z, Coenen F and Gani A. 2021. A survey on presentation attack detection for automatic speaker verification systems: state-of-the-art, taxonomy, issues and future direction. Multimedia Tools and Applications, 80(21): 32725-32762 [DOI: 10.1007/s11042-021-11235-xhttp://dx.doi.org/10.1007/s11042-021-11235-x]
Tang H Z, Zhang X L, Wang J Z, Cheng N, Zeng Z, Xiao E and Xiao J. 2021. TGAVC: improving autoencoder voice conversion with text-guided and adversarial training//Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Cartagena, Colombia: IEEE: 938-945 [DOI: 10.1109/ASRU51503.2021.9688088http://dx.doi.org/10.1109/ASRU51503.2021.9688088]
Tao J H, Fu R B, Yi J Y, Wang C L and Wang T. 2020. Development and challenge of speech forgery and detection. Journal of Cyber Security, 5(2): 28-38
陶建华, 傅睿博, 易江燕, 王成龙, 汪涛, 2020. 语音伪造与鉴伪的发展与挑战. 信息安全学报, 5(2): 28-38 [DOI: 10.19363/J.cnki.cn10-1380/tn.2020.02.03http://dx.doi.org/10.19363/J.cnki.cn10-1380/tn.2020.02.03]
Teng Z W, Fu Q C, White J, Powell M E and Schmidt D C. 2022. SA-SASV: an end-to-end spoof-aggregated spoofing-aware speaker verification system//Proceedings of Interspeech 2022, the 23rd Annual Conference of the International Speech Communication Association. Incheon, Korea (South): ISCA: 4391-4395 [DOI:10.21437/Interspeech.2022-11029http://dx.doi.org/10.21437/Interspeech.2022-11029]
Todisco M, Wang X, Vestman V, Sahidullah M, Delgado H, Nautsch A, Yamagishi J, Evans N, Kinnunen T H and Lee K A. 2019. ASVspoof 2019: future horizons in spoofed and fake audio detection//Proceedings of the 20th Interspeech Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 1008-1012 [DOI: 10.21437/Interspeech.2019-2249http://dx.doi.org/10.21437/Interspeech.2019-2249]
Tokuda K, Nankaku Y, Toda T, Zen H G, Yamagishi J and Oura K. 2013. Speech synthesis based on hidden Markov models. Proceedings of the IEEE, 101(5): 1234-1252 [DOI: 10.1109/JPROC.2013.2251852http://dx.doi.org/10.1109/JPROC.2013.2251852]
Tokuda K, Yoshimura T, Masuko T, Kobayashi T and Kitamura T. 2000. Speech parameter generation algorithms for HMM-based speech synthesis//Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Istanbul, Turkey: IEEE: 1315-1318 [DOI: 10.1109/ICASSP.2000.861820http://dx.doi.org/10.1109/ICASSP.2000.861820]
Tomilov A, Svishchev A, Volkova M, Chirkovskiy A, Kondratev A and Lavrentyeva G. 2021. STC antispoofing systems for the ASVspoof2021 challenge//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 61-67 [DOI: 10.21437/ASVSPOOF.2021-10http://dx.doi.org/10.21437/ASVSPOOF.2021-10]
van den Oord A, Dieleman S, Zen H G, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A W and Kavukcuoglu K. 2016. WaveNet: a generative model for raw audio//The 9th ISCA Speech Synthesis Workshop. Sunnyvale, USA: ISCA: #125
van den Oord A, Li Y Z, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, van den Driessche G, Lockhart E, Cobo L C, Stimberg F, Casagrande N, Grewe D, Noury S, Dieleman S, Elsen E, Kalchbrenner N, Zen H G, Graves A, King H L, Walters T, Belov D and Hassabis D. 2018. Parallel WaveNet: fast high-fidelity speech synthesis//Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden: PMLR: 3918-3926
Wang C L, Yi J Y, Tao J H, Sun H Y, Chen X, Tian Z K, Ma H X, Fan C H and Fu R B. 2022a. Fully automated end-to-end fake audio detection//Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia. Lisboa, Portugal: Association for Computing Machinery: 27-33 [DOI: 10.1145/3552466.3556530http://dx.doi.org/10.1145/3552466.3556530]
Wang C L, Yi J Y, Tao J H, Zhang C Y, Zhang S and Chen X. 2023a. Detection of cross-dataset fake audio based on prosodic and pronunciation features [EB/OL]. [2023-06-30]. http://arxiv.org/pdf/2305.13700.pdfhttp://arxiv.org/pdf/2305.13700.pdf
Wang C L, Yi J Y, Tao J H, Zhang C Y, Zhang S, Fu R B and Chen X. 2023b. TO-Rawnet: improving Rawnet with TCN and orthogonal regularization for fake audio detection [EB/OL]. [2023-06-30]. http://arxiv.org/pdf/2305.13701.pdfhttp://arxiv.org/pdf/2305.13701.pdf
Wang L, Yeoh B and Ng J W. 2022b. Synthetic voice detection and audio splicing detection using SE-Res2Net-Conformer architecture//The 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). Singapore, Singapore: IEEE: 115-119 [DOI: 10.1109/ISCSLP57327.2022.10037999http://dx.doi.org/10.1109/ISCSLP57327.2022.10037999]
Wang Q Q, Zhang X L, Wang J Z, Cheng N and Xiao J. 2022c. DRVC: a framework of any-to-any voice conversion with self-supervised learning//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE: 3184-3188 [DOI: 10.1109/ICASSP43922.2022.9747434http://dx.doi.org/10.1109/ICASSP43922.2022.9747434]
Wang X M, Qin X Y, Zhu T L, Wang C, Zhang S L and Li M. 2021. The DKU-CMRI system for the ASVspoof 2021 challenge: vocoder based replay channel response estimation//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 16-21 [DOI: 10.21437/ASVSPOOF.2021-3http://dx.doi.org/10.21437/ASVSPOOF.2021-3]
Wang Y X, Skerry-Ryan R J, Stanton D, Wu Y H, Weiss R J, Jaitly N, Yang Z H, Xiao Y, Chen Z F, Bengio S, Le Q V, Agiomyrgiannakis Y, Clark R and Saurous R A. 2017. Tacotron: towards end-to-end speech synthesis//Proceedings of the 18th Interspeech Annual Conference of the International Speech Communication Association. Stockholm, Sweden: ISCA: 4006-4010 [DOI:10.21437/Interspeech.2017-1452http://dx.doi.org/10.21437/Interspeech.2017-1452]
Wang Z Y and Hansen J H L. 2022. Audio anti-spoofing using simple attention module and joint optimization based on additive angular margin loss and meta-learning//Proceedings of the 23rd Interspeech Annual Conference of the International Speech Communication Association. Incheon, Korea(South): ISCA: 376-380 [DOI:10.21437/Interspeech.2022-904http://dx.doi.org/10.21437/Interspeech.2022-904]
Weiss R J, Skerry-Ryan R J, Battenberg E, Mariooryad S and Kingma D P. 2021. Wave-Tacotron: spectrogram-free end-to-end text-to-speech synthesis//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 5679-5683 [DOI: 10.1109/ICASSP39728.2021.9413851http://dx.doi.org/10.1109/ICASSP39728.2021.9413851]
Wu H B, Liu A T and Lee H Y. 2020a. Defense for black-box attacks on anti-spoofing models by self-supervised learning//Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 3780-3784 [DOI:10.21437/Interspeech.2020-2026http://dx.doi.org/10.21437/Interspeech.2020-2026]
Wu H B, Liu S X, Meng H L and Lee H Y. 2020b. Defense against adversarial attacks on spoofing countermeasures of ASV//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE: 6564-6568 [DOI: 10.1109/ICASSP40776.2020.9053643http://dx.doi.org/10.1109/ICASSP40776.2020.9053643]
Wu Z Z, Das R K, Yang J C and Li H Z. 2020c. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks//Proceedings of the 21st Interspeech Annual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 1101-1105 [DOI:10.21437/Interspeech.2020-1810http://dx.doi.org/10.21437/Interspeech.2020-1810]
Wu Z Z, Kinnunen T, Evans N, Yamagishi J, Hanilçi C, Sahidullah M and Sizov A. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany: ISCA: 2037-2041 [DOI: 10.21437/Interspeech.2015-462http://dx.doi.org/10.21437/Interspeech.2015-462]
Xu X X, Shi L, Chen X Q, Lin P Y, Lian J, Chen J H, Zhang Z H and Hancock E R. 2023. Any-to-any voice conversion with multi-layer speaker adaptation and content supervision. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 3431-3445 [DOI: 10.1109/TASLP.2023.3306716http://dx.doi.org/10.1109/TASLP.2023.3306716]
Xue J, Fan C H, Yi J Y, Wang C L, Wen Z Q, Zhang D and Lü Z. 2023. Learning from yourself: a self-distillation method for fake speech detection//Proceedings of 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island, Greece: IEEE: 1-5 [DOI: 10.1109/ICASSP49357.2023.10096837http://dx.doi.org/10.1109/ICASSP49357.2023.10096837]
Yadav A K S, Bhagtani K, Xiang Z Y, Bestagini P, Tubaro S and Delp E J. 2023. DSVAE: interpretable disentangled representation for synthetic speech detection [EB/OL]. [2023-06-30]. http://arxiv.org/pdf/2304.03323.pdfhttp://arxiv.org/pdf/2304.03323.pdf
Yamagishi J, Todisco M, Sahidullah M, Delgado H, Wang X, Evans N, Kinnunen T, Lee K A, Vestman V and Nautsch A. 2019. ASVspoof 2019: automatic speaker verification spoofing and countermeasures challenge evaluation plan [EB/OL]. [2023-10-20]. https://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdfhttps://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdf
Yamamoto R, Song E, Hwang M J and Kim J M. 2021. Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, Canada: IEEE: 6039-6043 [DOI: 10.1109/ICASSP39728.2021.9413369http://dx.doi.org/10.1109/ICASSP39728.2021.9413369]
Yamamoto R, Song E and Kim J M. 2020. Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, Spain: IEEE: 6199-6203 [DOI: 10.1109/ICASSP40776.2020.9053795http://dx.doi.org/10.1109/ICASSP40776.2020.9053795]
Yan X R, Yi J Y, Tao J H, Wang C L, Zhang C Y and Fu R B. 2023. System fingerprint recognition for deepfake audio: an initial dataset and investigation [EB/OL]. [2023-08-28]. http://arxiv.org/pdf/2208.10489.pdfhttp://arxiv.org/pdf/2208.10489.pdf
Yang J, Lee J, Kim Y I, Cho H Y and Kim I. 2020. VocGAN: a high-fidelity real-time vocoder with a hierarchically-nested adversarial network//Proceedings of the 21st InterspeechAnnual Conference of the International Speech Communication Association. Shanghai, China: ISCA: 200-204 [DOI:10.21437/Interspeech.2020-1238http://dx.doi.org/10.21437/Interspeech.2020-1238]
Yang J C and Das R K. 2020. Long-term high frequency features for synthetic speech detection. Digital Signal Processing, 97: #102622 [DOI: 10.1016/j.dsp.2019.102622http://dx.doi.org/10.1016/j.dsp.2019.102622]
Yang S, Qiao K, Chen J, Wang L Y and Yan B. 2022. Overview on speech synthesis, forgery and detection technology. Computer Systems and Applications, 31(7): 12-22
杨帅, 乔凯, 陈健, 王林元, 闫镔. 2022. 语音合成及伪造、鉴伪技术综述. 计算机系统应用, 31(7): 12-22 [DOI: 10.15888/j.cnki.csa.008641http://dx.doi.org/10.15888/j.cnki.csa.008641]
Yi J Y, Bai Y, Tao J H, Ma H X, Tian Z K, Wang C L, Wang T and Fu R B. 2021. Half-Truth: a partially fake audio detection dataset//Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 1654-1658 [DOI:10.21437/Interspeech.2021-930http://dx.doi.org/10.21437/Interspeech.2021-930]
Yi J Y, Fu R B, Tao J H, Nie S, Ma H X, Wang C L, Wang T, Tian Z K, Bai Y, Fan C H, Liang S, Wang S M, Zhang S, Yan X R, Xu L, Wen Z Q and Li H Z. 2022a. ADD 2022: the first audio deep synthesis detection challenge//Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, Singapore: IEEE: 9216-9220 [DOI: 10.1109/ICASSP43922.2022.9746939http://dx.doi.org/10.1109/ICASSP43922.2022.9746939]
Yi J Y, Tao J H, Fu R B, Yan X R, Wang C L, Wang T, Zhang C Y, Zhang X H, Zhao Y, Ren Y, Xu L, Zhou J Z, Gu H, Wen Z Q, Liang S, Lian Z, Nie S and Li H Z. 2023. ADD 2023: the second audio deepfake detection challenge [EB/OL]. [2023-08-28]. http://arxiv.org/pdf/2305.13774.pdfhttp://arxiv.org/pdf/2305.13774.pdf
Yi J Y, Wang C L, Tao J H, Tian Z K, Fan C H, Ma H X and Fu R B. 2022b. SceneFake: an initial dataset and benchmarks for scene fake audio detection [EB/OL]. [2023-06-30]. https://arxiv.org/pdf/2211.06073v1.pdfhttps://arxiv.org/pdf/2211.06073v1.pdf
Yoon S and Yu H J. 2021. Multiple-point input and time-inverted speech signal for the ASVspoof 2021 Challenge//Proceedings of 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: ISCA: 37-41 [DOI: 10.21437/ASVSPOOF.2021-6http://dx.doi.org/10.21437/ASVSPOOF.2021-6]
Zeinali H, Stafylakis T, Athanasopoulou G, Rohdin J, Gkinis I, Burget L and Cernocký J H. 2019. Detecting spoofing attacks using VGG and SincNet: BUT-omilia submission to ASVspoof 2019 challenge//Proceedings of Interspeech the 20th Annual Conference of the International Speech Communication Association. Graz, Austria: ISCA: 1073-1077 [DOI:10.21437/Interspeech.2019-2892http://dx.doi.org/10.21437/Interspeech.2019-2892]
Zen H G, Senior A and Schuster M. 2013. Statistical parametric speech synthesis using deep neural networks//Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada: IEEE: 7962-7966 [DOI: 10.1109/ICASSP.2013.6639215http://dx.doi.org/10.1109/ICASSP.2013.6639215]
Zen H G, Tokuda K and Black A W. 2009. Statistical parametric speech synthesis. Speech Communication, 51(11): 1039-1064 [DOI: 10.1016/j.specom.2009.04.004http://dx.doi.org/10.1016/j.specom.2009.04.004]
Zhang D, Li S M, Zhang X, Zhan J, Wang P Y, Zhou Y Q and Qiu X P. 2023a. SpeechGPT: empowering large language models with intrinsic cross-modal conversational abilities//Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore, Singapore: Association for Computational Linguistics: 15757-15773 [DOI: 10.18653/v1/2023.findings-emnlp.1055http://dx.doi.org/10.18653/v1/2023.findings-emnlp.1055]
Zhang H, Yuan T, Chen J K, Li X T, Zheng R J, Huang Y X, Chen X J, Gong E L, Chen Z Y, Hu X G, Yu D H, Ma Y J and Huang L. 2022a. PaddleSpeech: an easy-to-use all-in-one speech toolkit//Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations. Washington, USA: Association for Computational Linguistics: 114-123 [DOI: 10.18653/v1/2022.naacl-demo.12http://dx.doi.org/10.18653/v1/2022.naacl-demo.12]
Zhang L, Wang X, Cooper E, Evans N and Yamagishi J. 2023b. The PartialSpoof database and countermeasures for the detection of short fake speech segments embedded in an utterance. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 813-825 [DOI: 10.1109/TASLP.2022.3233236http://dx.doi.org/10.1109/TASLP.2022.3233236]
Zhang L, Wang X, Cooper E and Yamagishi J. 2021a. Multi-task learning in utterance-level and segmental-level spoof detection//2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. [s.l.]: NII Yamagishi Laboratory
Zhang X W, Li J K, Sun M and Zheng L L. 2020. Speech anti-spoofing: the state of the art and prospects. Journal of Data Acquisition and Processing, 35(5): 807-823
张雄伟, 李嘉康, 孙蒙, 郑琳琳, 2020. 语音欺骗检测方法的研究现状及展望. 数据采集与处理, 35(5): 807-823 [DOI: 10.16337/j.1004-9037.2020.05.002http://dx.doi.org/10.16337/j.1004-9037.2020.05.002]
Zhang Y, Jiang F and Duan Z Y. 2021b. One-class learning towards synthetic voice spoofing detection. IEEE Signal Processing Letters, 28: 937-941 [DOI: 10.1109/LSP.2021.3076358http://dx.doi.org/10.1109/LSP.2021.3076358]
Zhang Y, Jiang F, Zhu G, Chen X H and Duan Z Y. 2023c. Generalizing voice presentation attack detection to unseen synthetic attacks and channel variation//Marcel S, Fierrez J and Evans N, eds. Handbook of Biometric Anti-Spoofing: Presentation Attack Detection and Vulnerability Assessment. Singapore, Singapore: Springer Nature: 421-443 [DOI: 10.1007/978-981-19-5288-3_15http://dx.doi.org/10.1007/978-981-19-5288-3_15]
Zhang Y, Zhu G and Duan Z Y. 2022b. A probabilistic fusion framework for spoofing aware speaker verification//Odyssey 2022: The Speaker and Language Recognition Workshop. Beijing, China: ISCA: 77-84 [DOI: 10.21437/Odyssey.2022-11http://dx.doi.org/10.21437/Odyssey.2022-11]
Zhang Y, Zhu G, Jiang F and Duan Z Y. 2021c. An empirical study on channel effects for synthetic voice spoofing countermeasure systems//Proceedings of the 22nd Interspeech Annual Conference of the International Speech Communication Association. Brno, Czechia: ISCA: 4309-4313 [DOI:10.21437/Interspeech.2021-1820http://dx.doi.org/10.21437/Interspeech.2021-1820]
Zhang Y J, Pan S F, He L and Ling Z H. 2019. Learning latent representations for style control and transfer in end-to-end speech synthesis//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE: 6945-6949 [DOI: 10.1109/ICASSP.2019.8683623http://dx.doi.org/10.1109/ICASSP.2019.8683623]
Zhang Z Y, Gu Y W, Yi X W and Zhao X F. 2021d. FMFCC-A: a challenging mandarin dataset for synthetic speech detection//Digital Forensics and Watermarking-20th International Workshop, IWDW 2021. Beijing, China: Springer-Verlag: 117-131 [DOI: 10.1007/978-3-030-95398-0_9http://dx.doi.org/10.1007/978-3-030-95398-0_9]
Zhao Y, Yi J Y, Tao J H, Wang C L, Zhang C Y, Wang T and Dong Y F. 2022. EmoFake: an initial dataset for emotion fake audio detection [EB/OL]. [2023-06-30]. http://arxiv.org/pdf/2211.05363.pdfhttp://arxiv.org/pdf/2211.05363.pdf
相关作者
相关机构