Building a sentence-aligned Vietnamese–English bilingual corpus in tourism domain for machine translation

Nguyễn Tiến Hà, Nguyễn Thị Minh Huyền, Nguyễn Minh Hải


Sentence-aligned bilingual corpora constitute an important language resource for many applications in natural language processing, such as comparative linguistics, cross-language information retrieval, bilingual dictionary construction. In machine translation, in particular, the quality and the size of bilingual corpora plays a crucial role in translation quality. Present machine translation systems still need to be improved to handle many linguistic phenomena. Translation systems trained on general-domain corpora usually perform poorly on texts from a specific domain. A solution is to combine the general-domain translation model with a specific-domain translation model. Consequently, the construction of annotated bilingual corpora in specific domains is important. In this paper, we present our work on the construction of a Vietnamese–English bilingual corpus in the field of tourism, and the improvement of an existing sentence alignment tool for Vietnamese–English bilingual texts, with the accuracy of above 90% on our different datasets. With the help of this tool, we build a sentence-aligned tourism domain corpus which, when used to train a Vietnamese–English translation model, allows an improvement of about 8:79 BLEU scores in comparison with the models trained with only parallel general domain texts.

DOI: 10.32913/rd-ict.vol1.no39.550


Bilingual data, bilingual alignment, statistical machine translation, tourism domain corpus, Vietnamese–English machine translation.


