Building a sentence-aligned Vietnamese–English bilingual corpus in tourism domain for machine translation

Nguyễn Tiến Hà, Nguyễn Thị Minh Huyền, Nguyễn Minh Hải

Abstract


Sentence-aligned bilingual corpora constitute an important language resource for many applications in natural language processing, such as comparative linguistics, cross-language information retrieval, bilingual dictionary construction. In machine translation, in particular, the quality and the size of bilingual corpora plays a crucial role in translation quality. Present machine translation systems still need to be improved to handle many linguistic phenomena. Translation systems trained on general-domain corpora usually perform poorly on texts from a specific domain. A solution is to combine the general-domain translation model with a specific-domain translation model. Consequently, the construction of annotated bilingual corpora in specific domains is important. In this paper, we present our work on the construction of a Vietnamese–English bilingual corpus in the field of tourism, and the improvement of an existing sentence alignment tool for Vietnamese–English bilingual texts, with the accuracy of above 90% on our different datasets. With the help of this tool, we build a sentence-aligned tourism domain corpus which, when used to train a Vietnamese–English translation model, allows an improvement of about 8:79 BLEU scores in comparison with the models trained with only parallel general domain texts.

DOI: 10.32913/rd-ict.vol1.no39.550


Keywords


Bilingual data, bilingual alignment, statistical machine translation, tourism domain corpus, Vietnamese–English machine translation.

References


Philipp Koehn, MOSES Statistical Machine Translation System User Manual and Code Guide, September 19, 2016.

https://vlsp.hpda.vn/demo/?page=resources

Quoc-Hung Ngo, Werner Winiwarter, Building an English-Vietnamese Bilingual Corpus for Machine Translation, International Conference on Asian Language Processing 2012, pp. 157-160. IEEE Computer Society, 2012.

Đinh Điền, Lý Ngọc Minh, “Ứng dụng Ngữ liệu Song ngữ Anh-Việt trong Giảng dạy Ngôn ngữ”, hội thảo Liên ngành NNH Ứng dụng & Giảng dạy Ngôn ngữ, 11/2015, Huế, tr.559-567.

Mohammed M. Sakre, Mohammed M. Kouta, Ali M. N. Allam, automated construction of Arabic-English parallel corpus, Arab World English Journal (AWEJ) Special Issue on Translation No.5 May, 2016.

Peter F. Brown and Jennifer C. Lai and Robert L. Mercer, Aligning sentences in parallel corpora, Proceedings of the 29th Annual Meeting of the Association of Computational Linguistics (ACL), 1991.

William A. Gale and Kenneth Ward Church, A program for Aligning sentences in bilingual corpora, Proceedings of the 29th Annual Meeting of the Association of Computational Linguistics (ACL), 1991.

Martin Kay and Martin Röscheisen, Text-Translation Alignment, Computational Linguistics, 1993.

Stanley F. Chen, Aligning sentences in bilingual corpora using lexical information, Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL), 1993.

Michel Simard and Pierre Plamondon, Bilingual sentence alignment: Balancing Robustness and accuracy, Proceedings of the Conference of the Association for Machine Translation in the Americas, 1998.

Laurent Romary, Patrice Bonhomme. Parallel alignment of structured documents. Jean Véronis. Parallel Text Processing, Kluwer Academic Publisher, pp.233-253, 2000.

Nguyễn Thị Minh Huyền and Mathias Rossignol, A language-independent method for the alignement of parallel corpora, Proceedings of 20th Pacific Asia Conference on Language, Information and Computation (PACLIC), 2006.

Hai-Long Trieu, Phuong-Thai Nguyen, Le-Minh Nguyen, A New Feature to Improve Moore’s Sentence Alignment Method, VNU Journal of Science: Comp. Science & Com. Eng. Vol. 31. No. 1 (2015) 32–44.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, BLEU: a Method for Automatic Evaluation of Machine Translation, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, pp. 311-318, July 2002.


Full Text: PDF

CƠ QUAN CHỦ QUẢN: BỘ THÔNG TIN VÀ TRUYỀN THÔNG (MIC)
Giấp phép số 69/GP-TTĐT cấp ngày 26/12/2014.
Tổng biên tập: Vũ Chí Kiên
Tòa soạn: 110-112, Bà Triệu, Hà Nội; Điện thoại: 04. 37737136; Fax: 04. 37737130; Email: chuyensanbcvt@mic.gov.vn
Ghi rõ nguồn “Tạp chí Công nghệ thông tin và truyền thông” khi phát hành lại thông tin từ website này