Xác định đặc điểm tác giả bài viết diễn đàn tiếng Việt dựa trên âm tiết và vần

  • Duong Tran Duc Posts and Telecommunications Institute of Technology
  • Pham Bao Son University of Engineering and Technology, Vietnam National University, Hanoi
  • Tan Hanh Posts and Telecommunications Institute of Technology

Abstract

Author profiling is the task of identifying characteristics of the author just based on a text document. In the previous works, there are a number of linguistic features such as character-based, word-based, grammar-based (often grouped as style-based), and content-based features (content words) have been exploited. The previous results showed that content-based features often achieved better results than style-based features. However, using content-based features is considered as a domain-specific approach, because the content words chosen often have meaning related to the studied domain. In this work, we investigate the use of syllables and rhymes as features for author profiling of Vietnamese text. They are parts of words, but have much less meaning than words, especially the rhymes. Therefore, these features can be considered much less domain-dependent than content words. We experimented on forum post datasets using machine learning approach. With improvement up to 8% compared with baseline results on style-based features, our method shows a new promising approach on author profiling.

References

AHMED ABBASI, HSINCHUN CHEN. Applying Authorship Analysis to Extremist-Group Web Forum Messages, IEEE Intelligent Systems, v.20 n.5, p.67-75, 2005.

S. ARGAMON, M. KOPPEL, J. W. PENNEBAKER, J. SCHLER. Automatically profiling the author of an anonymous text, Communications of the ACM, v.52 n.2, 2009.

R. CLEMENT, D. SHARP. Ngram and Bayesian classification of documents for topic and authorship. Literary and Linguistic Computing, 18(4), pp: 423—447, 2003.

M. CORNEY, O. DE VEL, A. ANDERSON, G. MOHAY. Gender-preferential text mining of e-mail discourse, In ACSAC’02: Proc. of the 18th Annual Computer Security Applications Conference, Washington, DC, pp : 21-27, 2002.

O. DE VEL, A. ANDERSON, M. CORNEY, G. MOHAY. Mining e-mail content for author identification forensics. SIGMOD Record 30(4), pp. 55-64, 2001.

J. DIEDERICH, J. KINDERMANN, E. LEOPOLD, G. PAASS. Authorship Attribution with Support Vector Machines, Applied Intelligence, v.19 n.1-2, p.109-123, 2003.

D. L. THU, N. V. HUE. Cơ cấu ngữ âm tiếng Việt, Vietnam Education Publishing, 1998.

T. D. Duong, S. B. Pham, H. Tan. Using Content-based Features for Author Profiling of Vietnamese Forum Posts, In: Recent Developments in Intelligent Information and Database Systems, pp. 287–296. Springer International Publishing, Berlin, 2016.

MICHAEL GAMON. Linguistic correlates of style: authorship classification with deep linguistic analysis features, Proceedings of the 20th international conference on Computational Linguistics, p.611-es, 2004

S. GOSWANI, S. SARKAR, M. RUSTAGI. Stylometric analysis of bloggers' age and gender, In Proceedings of the Third International ICWSM Conference, San Jose, USA, 2009.

J. HOUVARDAS, E. STAMATATOS. N-Gram feature selection for authorship identification, Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications, Varna, Bulgaria, 2006.

F. IQBAL, H. BINSALLEEH, B. C. M. FUNG, M. DEBBABI. Mining writeprints from anonymous e-mails for forensic investigation, Digital Investigation: The International Journal of Digital Forensics & Incident Response, v.7 n.1-2, p.56-64, 2010.

V. KESELJ, F. PENG, N. CERCONE, C. THOMAS. N-gram-based author profiles for authorship attribution. In: Pasific Association for Computational Linguistics, pp. 256–264, 2003.

M. KOPPEL, J. SCHLER, K. ZIGDON. Determining an author's native language by mining a text for errors, Proceedings of the 7th ACM SIGKDD international conference on Knowledge discovery in data mining, USA, 2005.

M. KOPPEL, J. SCHLER, S. ARGAMON. Computational methods in authorship attribution. Journal of the American Society for information Science and Technology, 60(1), p.9-26, 2009.

H. P. LE, A. ROUSSANALY, T. M. H. NGUYEN, M. ROSSIGNOL. An empirical study of maximum entropy approach for part-of-speech tagging of vietnamese texts, In Traitement Automatique des Langues Naturelles-TALN, page 12, 2010.

D. NGUYEN, R. GRAVEL, D. TRIESCHNIGG, T. MEDER. "How old do you think I am?" a study of language and age in Twitter. In ICWSM, 2013.

D. H. NGUYEN. Vietnamese, Amsterdam: John Benjamins Publishing Company, 1997.

C. PEERSMAN, W. DAELEMANS, L. V. VAERENBERGH. Predicting age and gender in online social networks, In Proceedings of the 3rd international workshop on Search and mining user-generated contents, SMUC ’11, pages 37–44, New York, NY, USA, 2011.

D. D. PHAM, G. B. TRAN, S. B. PHAM, Author Profiling for Vietnamese Blogs, Proceedings of the 2009 International Conference on Asian Language Processing, p.190-194, 2009.

F. RANGEL, P. ROSSO. Use of language and author profiling: Identification of gender and age. In Natural Language Processing and Cognitive Science, p. 177, 2013.

E. STAMATATOS. A survey of modern authorship attribution methods, Journal of the American Society for information Science and Technology, 60(3), pp.538-556, 2009.

G. TANG. Cross-linguistic analysis of Vietnamese and English with implications for Vietnamese language acquisition and maintenance in the United States, Journal of Southeast Asian-American Education & Advancement, 2, 1–33, 2006.

I. H. WITTEN, E. FRANK. Data mining: Practical machine learning tools and techniques, Morgan Kaufmann, San Francisco, second edition, 2005.

R. ZHENG, H. CHEN, Z. HUANG, Y. QIN. Authorship Analysis in Cybercrime Investigation (Eds.): ISI 2003, LNCS 2665, pp : 59-73, 2003.

R. ZHENG, J. LI, H. CHEN, Z. HUANG. A framework for authorship identification of online messages: Writing-style features and classification techniques, Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 378–393, 2006.

Published
2017-05-31
Section
Bài báo