Một kỹ thuật biến đổi giọng người nói hiệu quả sử dụng kỹ thuật phân rã tiếng nói theo thời gian

  • Phùng Trung Nghĩa Trường Đại học Công nghệ thông tin và Truyền thông, Đại học Thái Nguyên


Voice transformation is an important issue in speech synthesis when we need to synthesize multiple output voices but do not want to rebuid the synthesis system. Speech transformed by the conventional method using Gaussian Mixture Model (GMM) is not high-quality due to the oversmoothness of GMM. Therefore, a number of methods have been proposed to overcome the disadvantages of the conventional method using GMM. Among them, Hidden Markov Model Trajectory Tiling (HTT) and Temporal Decomposition – GMM (TD-GMM) improve the effectiveness of voice transformation. However, they still have drawbacks. In this paper, a voice transformation method using the modified restricted TD (MRTD) is proposed. The experimental results with Vietnamese and English corpus confirm the effectiveness of the proposed method compared with HTT and TD-GMM.


Jurafsky. Daniel, James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, 1st Edition, 577-583, 2000.

Akagi Masato, "Analysis of Production and Perception Characteristics of Non-linguistic Information in Speech and Its Application to Inter-language Communications", Proceedings APSIPA ASC 2009.

Kain Alexander, Michael W. Macon, "Spectral voice conversion for text-to-speech synthesis", Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1998.

Phu Nguyen Binh, Masato Akagi, "Phoneme-based spectral voice conversion using temporal decomposition and Gaussian mixture model", Second IEEE International Conference Communications and Electronics, ICCE 2008, 2008.

Qian Yao, Frank K. Soong, Zhi-Jie Yan, "A unified trajectory tiling approach to high quality speech rendering", IEEE Transactions on Audio, Speech, and Language Processing, 21.2, 280-290, 2013.

Fujii Kei, Jun Okawa, Kaori Suigetsu, "High individuality voice conversion based on concatenative speech synthesis", World Academy of Science, Engineering and Technology, 2.1, 2007.

Nghia Phung Trung, et al., "A robust wavelet-based text-independent speaker identification”, International Conference on Conference on Computational Intelligence and Multimedia Applications, Vol. 2, 2007.

Nguyen Phu Chien, Ochi Takao, and Masato Akagi, "Modified restricted temporal decomposition and its application to low rate speech coding", IEICE Transactions on Information and Systems 86.3, 397-405, 2003.

Kawahara Hideki, "STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds", Acoustical science and technology 27.6 , 349-353, 2006.

L.C. Mai, D.N. Duc, “Design of Vietnamese speech corpus and current status", Proc. ISCSLP-06, pp. 748-758, 2006.

TT. Vu, MC. Luong, S. Nakamura, “An HMM-based Vietnamese speech synthesis system, Speech Database and Assessments”, Proc. COCOSDA-2009, pp. 116-121, 2009.

BẠCH HƯNG KHANG, Báo cáo tổng kết khoa học và kỹ thuật đề tài nghiên cứu phát triển công nghệ nhận dạng, tổng hợp và xử lý ngôn ngữ tiếng Việt KC01-03, trang 26, 2004.

A. Wrench, “The MOCHA-TIMIT articulatory database,” Queen Margaret University College, http://www.cstr.ed.ac.uk/artic/mocha.html, 1999.

Bài báo