DaNangVMD: Vietnamese Speech Mispronunciation Detection

  • Ket Doan Nguyen


Automatic Speech Recognition, also known as ASR, has grown
exponentially over the past decade and is used to recognize and translate human
speech into readable text automatically. However, Vietnamese Speech
Recognition faces critical challenges such as easy to make mispronunciations as
well as a huge variant in Vietnamese speech. In this work, we dive into the
difficult challenge of Mispronunciation Detection (MD) in the Vietnamese
language. As such a tonal language, Vietnamese is not only based on consonants
and vowels but also on variations in pitch or tone during pronunciation. In this
paper, we propose DaNangVMD model for detecting mispronunciations in
Vietnamese speech based on the audio speech and canonical transcript. By
leveraging multi-head attention-based multimodal representation from the
embeddings of the phonetic encoder and linguistic encoder, DaNangVMD aims
to provide a robust solution for accurate mispronunciation detection and
diagnosis. Throughout the extensive evaluation, the proposed DaNangVMD
exhibits superior performances rather than that of the PAPL baseline models by
15% in F1 score and 13 % in accuracy.


Nguyen Ket Doan1, Tran Nguyen Anh1, Vo Van Nam1, Nguyen Tran Tien1, Le Pham
Tuyen2, Nguyen Quoc Vuong3 and Nguyen Huu Nhat Minh1*
1The University of Danang, Vietnam - Korea University of Information and Communication Technology
2Industrial University of Ho Chi Minh City
3Dong A University