Software defect prediction model based on distance metric learning. (English) Zbl 1491.68051

Summary: Software defect prediction (SDP) is a very important way for analyzing software quality and reducing development costs. The data during software lifecycle can be used to predict software defect. Currently, many SDP models have been proposed; however, their performance was not always ideal. In many existing prediction models based on machine learning, the distance metric between samples has significant impact on the performance of the SDP model. In addition, most samples are usually class imbalanced. To solve these issues, in this paper, a novel distance metric learning based on cost-sensitive learning (CSL) is proposed for reducing the impact of class imbalance of samples, which is then applied to the large margin distribution machine (LDM) to substitute the traditional kernel function. Further, the improvement and optimization of LDM based on CSL are also studied, and the improved LDM is used as the SDP model, called as CS-ILDM. Subsequently, the proposed CS-ILDM is applied to five publicly available data sets from the NASA Metrics Data Program repository and its performance is compared to other existing SDP models. The experimental results confirm that the proposed CS-ILDM not only has good prediction performance, but also can reduce the misprediction cost and avoid the impact of class imbalance of samples.


68N30 Mathematical aspects of software engineering (specification, verification, metrics, requirements, etc.)
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI


[1] Ahmed, I.; Shabib, A.; Faseeha, M., Performance analysis of resampling techniques on class imbalance issue in software defect prediction, Int J Inf Technol Comput Sci, 11, 44-53 (2019)
[2] Ammann, P.; Offutt, J., Introduction to software testing (2016), Cambridge: Cambridge University Press, Cambridge
[3] Arar, ÖF; Ayan, K., Software defect prediction using cost-sensitive neural network, Appl Soft Comput, 33, 263-277 (2015)
[4] Barandela, R.; Sánchez, JS; Garcıa, V., Strategies for learning in class imbalance problems, Pattern Recognit, 36, 3, 849-851 (2003)
[5] Bar-Hillel A, Hertz T, Shental N et al (2003) Learning distance functions using equivalence relations. In: 20th international conference on machine learning, 21-24 August 2003, Washington, USA, pp 11-18
[6] Benítez-Peña, S.; Blanquero, R.; Carrizosa, E., Cost-sensitive feature selection for support vector machines, Comput Oper Res, 106, 169-178 (2019) · Zbl 1458.68158
[7] Bradley, AP, ROC curve equivalence using the Kolmogorov-Smirnov test, Pattern Recognit Lett, 34, 5, 470-475 (2013)
[8] Cristianini, N.; Shawe-Taylor, J.; Elisseeff, A., On kernel-target alignment, Adv Neural Inf Process Syst, 14, 367-373 (2002)
[9] Czibula, G.; Marian, Z.; Czibula, IG, Software defect prediction using relational association rule mining, Inf Sci, 264, 260-278 (2014)
[10] Davis JV, Kulis B, Jain P et al (2007) Information-theoretic metric learning. In: ACM 24th international conference on machine learning. 20-24 June 2007, Oregon, USA, pp 209-216
[11] Dejaeger, K.; Verbraken, T.; Baesens, B., Toward comprehensible software fault prediction models using Bayesian network classifiers, IEEE Trans Softw Eng, 39, 2, 237-257 (2013)
[12] Du, XT; Zhou, ZH; Yin, BB; Xiao, GP, Cross-project bug type prediction based on transfer learning, Softw Qual J, 28, 39-57 (2020)
[13] Elkan C (2001) The foundations of cost-sensitive learning. In: 17th international joint conference on artificial intelligence. 4-10 August 2001, Seattle, USA, II, pp 973-978
[14] Erturk, E.; Sezer, EA, A comparison of some soft computing methods for software fault prediction, Expert Syst Appl, 42, 4, 1872-1879 (2015)
[15] Ghari, PM; Shahbazian, R.; Ghorashi, SA, Maximum entropy-based semi-definite programming for wireless sensor network localization, IEEE Internet Things J, 6, 2, 3480-3491 (2019)
[16] Goldberger J, Roweis S, Hinton G et al (2005) Neighbourhood components analysis. In: Advances in neural information processing systems, vol 17, Cambridge, MA, pp 513-520
[17] Halstead, MH, Elements of software science (1977), New York: North-Holland, New York · Zbl 0381.68007
[18] Hoo, ZH; Candlish, J.; Teare, D., What is an ROC curve?, Emerg Med J, 34, 6, 357-359 (2017)
[19] Hsieh CJ, Chang KW, Lin CJ et al (2008) A dual coordinate descent method for large-scale linear SVM. In: ACM 25th international conference on machine learning, 5-9 July 2008, Helsinki, Finland, pp 408-415
[20] Jabeen G, Yang X, Ping L et al (2017) Hybrid software reliability prediction model based on residual errors. In: 8th IEEE international conference on software engineering and service science, 24-26 November 2017, Beijing, China, pp 479-482
[21] Jiang, Y.; Cukic, B.; Ma, Y., Techniques for evaluating fault prediction models, Empir Softw Eng, 13, 5, 561-595 (2008)
[22] Jin, C., Software reliability prediction based on support vector regression using a hybrid genetic algorithm and simulated annealing algorithm, IET Softw, 5, 4, 398-405 (2011)
[23] Jin, C.; Jin, SW, Software reliability prediction model based on support vector regression with improved estimation of distribution algorithms, Appl Soft Comput, 15, 113-120 (2014)
[24] Jin, C.; Jin, SW, Applications of fuzzy integrals for predicting software fault-prone, J Intell Fuzzy Syst, 26, 2, 721-729 (2014) · Zbl 1305.68059
[25] Jin, C.; Jin, SW, Parameter optimization of software reliability growth model with S-shaped testing-effort function using improved swarm intelligent optimization, Appl Soft Comput, 40, 283-291 (2016)
[26] Jin, C.; Jin, SW, Image distance metric learning based on neighborhood sets for automatic image annotation, J Vis Commun Image Represent, 34, 167-175 (2016)
[27] Jin C, Jin SW (2016c) A multi-label image annotation scheme based on improved SVM multiple kernel learning. In: 8th international conference on graphic and image processing, 29-31 October 2016, Tokyo, Japan, 10225-1-6
[28] Jin, C.; Jin, SW; Ye, JM, Artificial neural network-based metric selection for software fault-prone prediction model, IET Softw, 6, 6, 479-487 (2012)
[29] Katsumata S, Takeda A (2015) Robust cost sensitive support vector machine. In: Eighteenth international conference on artificial intelligence and statistics, 10-12 May 2015, San Diego, USA, pp 434-443
[30] Kim, T.; Lee, K.; Baik, J., An effective approach to estimating the parameters of software reliability growth models using a real-valued genetic algorithm, J Syst Softw, 102, 134-144 (2015)
[31] Lanckriet, GRG; Cristianini, N.; Bartlett, P., Learning the kernel matrix with semi-definite programming, J Mach Learn Res, 5, 27-72 (2004) · Zbl 1222.68241
[32] Lv YD, Wang Y, Tan YF et al (2017) Pancreatic cancer biomarker detection using recursive feature elimination based on support vector machine and large margin distribution machine. In: 4th international conference on systems and informatics, 11-13 November 2017, Hangzhou China, pp 1450-1455
[33] McCabe, TJ, A complexity measure, IEEE Trans Softw Eng, 4, 308-320 (1976) · Zbl 0352.68066
[34] McCabe, TJ; Butler, CW, Design complexity measurement and testing, Commun ACM, 32, 12, 1415-1425 (1989)
[35] Menzies T, Di Stefano JS (2004) How good is your blind spot sampling policy. In: Eighth IEEE international symposium on high assurance systems engineering, 25-26 March 2004, Tampa, USA, pp 129-138
[36] Miholca, DL; Czibula, G.; Czibula, IG, A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks, Inf Sci, 441, 152-170 (2018)
[37] Moepya SO, Akhoury SS, Nelwamondo FV (2014) Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In: 2014 IEEE international conference on data mining workshop, 14 December 2014, Shenzhen, China, pp 183-192
[38] Mutlu B, Sezer EA, Akcayol MA (2018) Automatic rule generation of fuzzy systems: a comparative assessment on software defect prediction. In: IEEE 3rd international conference on computer science and engineering. 20-23 September 2018, Federacija Bosna, pp 209-214
[39] Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: ACM 27th international conference on software engineering. 15-21 May 2005, St. Louis, USA, pp 284-292
[40] Noekhah S, Salim NB, Zakaria NH (2017) Predicting software reliability with a novel neural network approach. In: International conference of reliable information and communication technology. Springer, Cham, pp 907-916
[41] Okutan, A.; Yıldız, OT, Software defect prediction using Bayesian networks, Empir Softw Eng, 19, 1, 154-181 (2014)
[42] Reshma, R.; Anand, P.; Chandra, S., Large-margin distribution machine-based regression, Neural Comput Appl (2018)
[43] Samanta, K.; Ozbolat, IT; Koc, B., Optimized normal and distance matching for heterogeneous object modeling, Comput Ind Eng, 69, 1-11 (2014)
[44] Scholkopf, B.; Smola, AJ, Learning with kernels: support vector machines, regularization, optimization, and beyond (2001), Cambridge: MIT press, Cambridge
[45] Seldag OK, Ayse T (2018) Periodic developer metrics in software defect prediction. In: IEEE 18th international working conference on source code analysis and manipulation. 23-24 September 2018, Madrid, Spain, pp 72-81
[46] Semwal, VB; Mondal, K.; Nandi, GC, Robust and accurate feature selection for humanoid push recovery and classification: deep learning approach, Neural Comput Appl, 28, 3, 565-574 (2017)
[47] Semwal, VB; Gaud, N.; Nandi, GC; Tanveer, M.; Pachori, R., Human gait state prediction using cellular automata and classification using ELM, Machine intelligence and signal analysis. Advances in intelligent systems and computing, 135-145 (2019), Singapore: Springer, Singapore
[48] Shigeo, A., Unconstrained large margin distribution machines, Pattern Recognit Lett, 98, 15, 96-102 (2017)
[49] Shull F, Basili V, Boehm B et al (2002) What we have learned about fighting defects. In: Eighth IEEE symposium on software metrics, 4-7 June 2002, Ottawa, Canada, pp 249-258
[50] Silva, J.; Bacao, F.; Dieng, M., Improving specific class mapping from remotely sensed data by cost-sensitive learning, Int J Remote Sens, 38, 11, 3294-33166 (2017)
[51] Sun, Z.; Song, Q.; Zhu, X., Using coding-based ensemble learning to improve software defect prediction, IEEE Trans Syst Man Cybern Part C (Appl Rev), 42, 6, 1806-1817 (2012)
[52] Tang, M.; Ding, SX; Yang, C., Cost-sensitive large margin distribution machine for fault detection of wind turbines, Clust Comput, 22, 7525-7537 (2019)
[53] Tenenbaum, JB; De Silva, V.; Langford, JC, A global geometric framework for nonlinear dimensionality reduction, Science, 290, 5500, 2319-2323 (2000)
[54] Teshome, A.; Rao, VS, A cost sensitive machine learning approach for intrusion detection, Glob J Comput Sci Technol, 14, 6, 1-8 (2014)
[55] Thwin, MMT; Quah, TS, Application of neural networks for software quality prediction using object-oriented metrics, J Syst Softw, 76, 2, 147-156 (2005)
[56] Uricchio, T.; Ballan, L.; Seidenari, L., Automatic image annotation via label transfer in the semantic space, Pattern Recognit, 71, 144-157 (2017)
[57] Vehtari, A.; Gelman, A.; Gabry, J., Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC, Stat Comput, 27, 5, 1413-1432 (2017) · Zbl 06737719
[58] Viji, C.; Rajkumar, N.; Duraisamy, S., Prediction of software fault-prone classes using an unsupervised hybrid SOM algorithm, Clust Comput, 22, 1, 133-143 (2019)
[59] Wan, HY; Wu, GQ; Yu, ML, Software defect prediction based on cost-sensitive dictionary learning, Int J Software Eng Knowl Eng, 29, 9, 1219-1243 (2019)
[60] Wang, S.; Yao, X., Using class imbalance learning for software defect prediction, IEEE Trans Reliab, 62, 2, 434-443 (2013)
[61] Wei, YK; Jin, C., Locality sensitive discriminant projection for feature extraction and face recognition, J Electron Imaging, 28, 4, 043028 (2019)
[62] Weinberger, KQ; Blitzer, J.; Saul, LK, Distance metric learning for large margin nearest neighbor classification, J Mach Learn Res, 10, 207-244 (2009) · Zbl 1235.68204
[63] Xu L, Wang B, Liu L et al (2018) Misclassification cost-sensitive software defect prediction. In: IEEE international conference on information reuse and integration, 6-9 July 2018, Salt Lake City, USA, pp 256-263
[64] Ying, Y.; Li, P., Distance metric learning with eigenvalue optimization, J Mach Learn Res, 13, 1-26 (2012) · Zbl 1283.68309
[65] Zhou ZH (2014) Large margin distribution learning. In: IAPR workshop on artificial neural networks in pattern recognition. Springer, Cham, pp 1-11
[66] Zhou, Y.; Leung, H., Empirical analysis of object-oriented design metrics for predicting high and low severity faults, IEEE Trans Softw Eng, 32, 10, 771-789 (2006)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.