A Comparative Analysis of Filter-Based Feature Selection Methods for Software Fault Prediction

Phân tích so sánh các kỹ thuật lựa chọn đặc trưng dựa trên phương pháp lọc trong dự đoán lỗi phần mềm

  • Thị Minh Phương Hà
  • Thi My Hanh Le
  • Thanh Binh Nguyen University of Danang - Vietnam-Korea University of Information and Communication Technology
Keywords: Feature selection, filter, wrapper, hybrid, embedded, Lựa chọn đặc trưng, phương pháp lọc, phương pháp bao bọc, phương pháp lai, phương pháp nhúng


The rapid growth of data has become a huge challenge for software systems. The quality of fault prediction
model depends on the quality of software dataset. High-dimensional data is the major problem that affects the performance of the fault prediction models. In order to deal with dimensionality problem, feature selection is proposed by various researchers. Feature selection method provides an effective solution by eliminating irrelevant and redundant features, reducing computation time and improving the accuracy of the machine learning model. In this study, we focus on research and synthesis of the Filter-based feature selection with several search methods and algorithms. In addition, five filter-based feature selection methods are analyzed using five different classifiers over datasets obtained from National Aeronautics and Space Administration (NASA) repository. The experimental results show that Chi-Square and Information Gain methods had the best influence on the results of predictive models over other filter ranking methods.


[1] Guyon, Isabelle, et al., eds. Feature extraction: foundations and applications. Vol. 207. Springer, 2008.
[2] Balogun, Abdullateef Oluwagbemiga, et al. "Performance analysis of feature selection methods in software defect prediction: a search method approach."
Applied Sciences 9.13 (2019): 2764.
[3] Yan, Ke, and David Zhang. "Feature selection and
analysis on correlated gas sensor data with recursive
feature elimination." Sensors and Actuators B: Chemical 212 (2015): 353-363.
[4] Jain, Anil, and Douglas Zongker. "Feature selection: Evaluation, application, and small sample performance." IEEE transactions on pattern analysis and
machine intelligence 19.2 (1997): 153-158.
[5] Akintola, Abimbola Ganiyat, et al. "Comparative analysis of selected heterogeneous classifiers for software
defects prediction using filter-based feature selection
methods." (2018).
[6] Gutkin, Michael, Ron Shamir, and Gideon Dror.
"SlimPLS: a method for feature selection in gene
expression-based disease classification." PloS one 4.7
(2009): e6416.
[7] Ang, Jun Chin, et al. "Supervised, unsupervised, and
semi-supervised feature selection: a review on gene
selection." IEEE/ACM transactions on computational
biology and bioinformatics 13.5 (2015): 971-989.
[8] Gheyas, Iffat A., and Leslie S. Smith. "Feature subset
selection in large dimensionality domains." Pattern
recognition 43.1 (2010): 5-13.
[9] Dash, Manoranjan, and Huan Liu. "Feature selection for classification." Intelligent data analysis 1.1-4
(1997): 131-156.
[10] Hà, Thị Minh Phương. "Nghiên cứu các kỹ thuật lựa
chọn đặc trưng trong tập dữ liệu." Hội thảo Khoa học
quốc gia CITA 2020 lần thứ 9 (2020): 204-210
[11] Jin, Xin, et al. "Machine learning techniques and chisquare feature selection for cancer classification using
SAGE gene expression profiles." International Workshop on Data Mining for Biomedical Applications.
Springer, Berlin, Heidelberg, 2006.
[12] Duda, Richard O., Peter E. Hart, and David G. Stork.
"Pattern Classification Wiley." New York 680 (2001).
[13] Hall, M. A., Smith, L. A. "Practical feature subset selection for machine learning". In C. McDonald(Ed.), Computer Science ’98 Proceedings of
the 21st Australasian Computer Science Conference
ACSC’98, Perth, 4-6 February, 1998(pp. 181-191).
Berlin: Springer
[14] Han, Jiawei, Micheline Kamber, and Jian Pei. "Data
mining concepts and techniques third edition." The
Morgan Kaufmann Series in Data Management Systems 5.4 (2011): 83-124.
[15] Kira, Kenji, and Larry A. Rendell. "A practical approach to feature selection." Machine learning proceedings 1992. Morgan Kaufmann, 1992. 249-256.
[16] Gao, Kehan, et al. "Choosing software metrics for
defect prediction: an investigation on feature selection
techniques." Software: Practice and Experience 41.5
(2011): 579-606.
[17] Muthukumaran, K., Akhila Rallapalli, and NL Bhanu
Murthy. "Impact of feature selection techniques on
bug prediction models." Proceedings of the 8th India
Software Engineering Conference. 2015.
[18] Rathore, Santosh Singh, and Atul Gupta. "A comparative study of feature-ranking and feature-subset
selection techniques for improved fault prediction."
Proceedings of the 7th India Software Engineering
Conference. 2014.
[19] Gu, Q.,Z. Li and J.Han. "Generalized fisher score
for feature selection, Twenty-Seventh Conference on
Uncertainty in Artificial Intelligence."(2011):266-273.
[20] Xu, Zhou, et al. "The impact of feature selection
on defect prediction performance: An empirical comparison." 2016 IEEE 27th International Symposium
on Software Reliability Engineering (ISSRE). IEEE,
[21] Catal, Cagatay, and Banu Diri. "A systematic review
of software fault prediction studies." Expert systems
with applications 36.4 (2009): 7346-7354.
[22] Wang, Huanjing, Taghi M. Khoshgoftaar, and Amri
Napolitano. "A comparative study of ensemble feature
selection techniques for software defect prediction."
2010 Ninth International Conference on Machine
Learning and Applications. IEEE, 2010.