×

Performance comparison of machine learning platforms. (English) Zbl 07281707

Summary: In this paper, we present a method for comparing and evaluating different collections of machine learning algorithms on the basis of a given performance measure (e.g., accuracy, area under the curve (AUC), \(F\)-score). Such a method can be used to compare standard machine learning platforms such as SAS, IBM SPSS, and Microsoft Azure ML. A recent trend in automation of machine learning is to exercise a collection of machine learning algorithms on a particular problem and then use the best performing algorithm. Thus, the proposed method can also be used to compare and evaluate different collections of algorithms for automation on a certain problem type and find the best collection. In the study reported here, we applied the method to compare six machine learning platforms – R, Python, SAS, IBM SPSS Modeler, Microsoft Azure ML, and Apache Spark ML. We compared the platforms on the basis of predictive performance on classification problems because a significant majority of the problems in machine learning are of that type. The general question that we addressed is the following: Are there platforms that are superior to others on some particular performance measure? For each platform, we used a collection of six classification algorithms from the following six families of algorithms – support vector machines, multilayer perceptrons, random forest (or variant), decision trees/gradient boosted trees, Naive Bayes/Bayesian networks, and logistic regression. We compared their performance on the basis of classification accuracy, \(F\)-score, and AUC. We used \(F\)-score and AUC measures to compare platforms on two-class problems only. For testing the platforms, we used a mix of data sets from (1) the University of California, Irvine (UCI) library, (2) the Kaggle competition library, and (3) high-dimensional gene expression problems. We performed some hyperparameter tuning on algorithms wherever possible.
The online supplement is available at https://doi.org/10.1287/ijoc.2018.0825.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Alimoglu F, Alpaydin E (1996) Methods of combining multiple classifiers based on different representations for pen-based handwritten digit recognition. Proc. Fifth Turkish Artificial Intelligence Artificial Neural Networks Symposium (TAINN 96) (TÜBITAK, Ankara, Turkey).Google Scholar
[2] Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, et al.. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503-511.Crossref, Google Scholar · doi:10.1038/35000501
[3] Alpaydm E (1999) Combined 5 × 2 cv F test for comparing supervised classification learning algorithms. Neural Comput. 11(8):1885-1892.Crossref, Google Scholar · doi:10.1162/089976699300016007
[4] Bhatt RB, Sharma G, Dhall A, Chaudhury S (2009). Efficient skin region segmentation using low complexity fuzzy decision tree model. 2009 Annual IEEE India Conference (Curran Associates, Red Hook, NY), 1-4.Crossref, Google Scholar · doi:10.1109/INDCON.2009.5409447
[5] Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput. Electronics Agriculture. 24(3):131-151.Crossref, Google Scholar · doi:10.1016/S0168-1699(99)00046-0
[6] Blake CL, Merz CJ (1998) UCI Repository of Machine Learning Databases. (University of California, Irvine, School of Information and Computer Science, Irvine, CA).Google Scholar
[7] Bock RK, Chilingarian A, Gaug M, Hakl F, Hengstebeck T, Jiřina M, Klaschka J, et al. (2004) Methods for multidimensional event classification: A case study using images from a Cherenkov gamma-ray telescope. Nuclear Instruments Methods Phys. Res. Sect. A: Accelerators, Spectrometers, Detectors Associated Equipment 516(2-3):511-528.Crossref, Google Scholar · doi:10.1016/j.nima.2003.08.157
[8] Breiman L (2001) Random forests. Machine Learn. 45(1):5-32.Crossref, Google Scholar · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[9] Bridge JP, Holden SB, Paulson LC (2014) Machine learning for first-order theorem proving. J. Automated Reasoning 53(2):141-172.Crossref, Google Scholar · Zbl 1314.68274 · doi:10.1007/s10817-014-9301-5
[10] Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems Appl. 39(3):3446-3453.Crossref, Google Scholar · doi:10.1016/j.eswa.2011.09.033
[11] Candanedo LM, Feldheim V (2016) Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Energy Buildings 112:28-39.Crossref, Google Scholar · doi:10.1016/j.enbuild.2015.11.071
[12] Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. Proc. 23rd Internat. Conf. Machine Learning (ACM, New York), 161-168.Crossref, Google Scholar · doi:10.1145/1143844.1143865
[13] Cattral R, Oppacher F, Deugo D (2002) Evolutionary data mining with automatic rule generalization. Recent Advances in Computers. Comput. Comm. 1(1):296-300.Google Scholar
[14] Corder GW, Foreman DI (2014) Nonparametric Statistics: A Step-by-Step Approach (John Wiley & Sons, Hoboken, NJ).Google Scholar · Zbl 1288.62001
[15] Cortes C, Vapnik V (1995) Support vector machine. Machine Learn. 20(3):273-297.Crossref, Google Scholar · Zbl 0831.68098 · doi:10.1007/BF00994018
[16] Cox DR, Snell EJ (1989) Analysis of Binary Data, vol. 32 (CRC Press, Boca Raton, FL).Google Scholar · Zbl 0729.62004
[17] Davenport TH, Patil DJ (2012) Data scientist. Harvard Bus. Rev. 90:70-76.Google Scholar
[18] Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J. Machine Learn. Res. 7:1-30.Google Scholar · Zbl 1222.68184
[19] Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7):1895-1923.Crossref, Google Scholar · doi:10.1162/089976698300017197
[20] Dietterich TG, Jain AN, Lathrop RH, Lozano-Perez T (1994) A comparison of dynamic reposing and tangent distance for drug activity prediction. Adv. Neural Inform. Processing Systems. 6:216-223.Google Scholar
[21] Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97(457):77-87.Crossref, Google Scholar · Zbl 1073.62576 · doi:10.1198/016214502753479248
[22] Dunn OJ (1961) Multiple comparisons among means. J. Amer. Statist. Assoc. 56:52-64.Crossref, Google Scholar · Zbl 0103.37001 · doi:10.1080/01621459.1961.10482090
[23] Eugster MJ, Hothorn T, Leisch F (2016) Domain-based benchmark experiments: Exploratory and inferential analysis. Austrian J. Statist. 41(1):5-26.Crossref, Google Scholar · doi:10.17713/ajs.v41i1.185
[24] Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems. J. Machine Learn. Res. 15(1):3133-3181.Google Scholar · Zbl 1319.62005
[25] Fisher RA (1959) Statistical Methods and Scientific Inference, 2nd ed. (Hafner Publishing Co., New York).Google Scholar
[26] Frey PW, Slate DJ (1991) Letter recognition using Holland-style adaptive classifiers. Machine Learn. 6(2):161-182.Crossref, Google Scholar · doi:10.1007/BF00114162
[27] Friedman JH (2001) Greedy function approximation: A gradient boosting machine. Ann. Statist. 29(5):1189-1232.Crossref, Google Scholar · Zbl 1043.62034 · doi:10.1214/aos/1013203451
[28] Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Amer. Statist. Assoc. 32(200):675-701.Crossref, Google Scholar · JFM 63.1098.02 · doi:10.1080/01621459.1937.10503522
[29] Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Statist. 11(1):86-92.Crossref, Google Scholar · Zbl 0063.01455 · doi:10.1214/aoms/1177731944
[30] García S, Fernandez A, Benıtez AD, Herrera F (2007) Statistical comparisons by means of non-parametric tests: A case study on genetic based machine learning. Accessed May 10, 2016, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.601.9106.Google Scholar
[31] García S, Molina D, Lozano M, Herrera F (2009) A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: A case study on the CEC’2005 special session on real parameter optimization. J. Heuristics 15(6):617-644.Crossref, Google Scholar · Zbl 1191.68828 · doi:10.1007/s10732-008-9080-4
[32] García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inform. Sci. 180(10):2044-2064.Crossref, Google Scholar · doi:10.1016/j.ins.2009.12.010
[33] Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, et al.. (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression. Science 286:531-537.Crossref, Google Scholar · doi:10.1126/science.286.5439.531
[34] Guvenir HA, Acar B, Demiroz G, Cekin A (1997) A supervised machine learning algorithm for arrhythmia analysis. Comput. Cardiol. 1997:433-436.Google Scholar
[35] Higgins JJ (2003) Introduction to modern nonparametric statistics. Accessed May 10, 2016, https://www.amazon.com/Introduction-Modern-Nonparametric-Statistics-Higgins/dp/0534387756.Google Scholar
[36] Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Machine Intelligence 20(8):832-844.Crossref, Google Scholar · doi:10.1109/34.709601
[37] Hollander M, Wolfe DA, Chicken E (2013) Nonparametric Statistical Methods (John Wiley & Sons, Hoboken, NJ).Google Scholar
[38] Holm S (1979) A simple sequentially rejective multiple test procedure. Scandinavian J. Statist. 6(1):65-70.Google Scholar · Zbl 0402.62058
[39] Horton P, Nakai K (1996) A probabilistic classification system for predicting the cellular localization sites of proteins. Proc. Internat. Conf. Intelligent Systems Molecular Biol. 4:109-115.Google Scholar
[40] Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied Logistic Regression, vol. 398 (John Wiley & Sons, Hoboken, NJ).Crossref, Google Scholar · Zbl 1276.62050 · doi:10.1002/9781118548387
[41] Hothorn T, Leisch F, Zeileis A, Hornik K (2005) The design and analysis of benchmark experiments. J. Comput. Graphical Statist. 14(3):675-699.Crossref, Google Scholar · doi:10.1198/106186005X59630
[42] IBM SPSS Modeler (2016) Accessed May 10, 2016, https://www-01.ibm.com/software/analytics/spss/products/modeler/.Google Scholar
[43] Ihaka R, Gentleman R (1996) R: A language for data analysis and graphics. J. Comput. Graphical Statist. 5(3):299-314.Google Scholar
[44] Iman RL, Davenport JM (1980) Approximations of the critical region of the fbietkan statistic. Comm. Statist. Theory Methods. 9(6):571-595.Crossref, Google Scholar · Zbl 0451.62061 · doi:10.1080/03610928008827904
[45] Kaluža B, Mirchevska V, Dovgan E, Luštrek M, Gams M (2010) An agent-based approach to care in independent living. De Ruyter B, Wichert R, Keyson DV, Markopoulos P, Streitz N, Divitini M, Georgatas N, Mana Gomez A, eds. Ambient Intelligence (Springer Berlin, Heidelberg), 177-186.Crossref, Google Scholar · doi:10.1007/978-3-642-16917-5_18
[46] Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7(6):673-679.Crossref, Google Scholar · doi:10.1038/89044
[47] King RD, Feng C, Sutherland A (1995) Statlog: Comparison of classification algorithms on large real-world problems. Appl. Artificial Intelligence Internat. J. 9(3):289-333.Crossref, Google Scholar · doi:10.1080/08839519508945477
[48] Kleinbaum DG, Klein M (2010) Logistic Regression: A Self-Learning Text (Springer Verlag, New York).Google Scholar · Zbl 1194.62090
[49] Kohavi R (1996) Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. Simoudis E, Han J, Fayyad U, eds. Proc. 2nd Internat. Conf. Knowledge Discovery Data Mining (AAAI Press, Palo Alto, CA), 202-207.Google Scholar
[50] Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Software Engrg. 34(4):485-496.Crossref, Google Scholar · doi:10.1109/TSE.2008.35
[51] Lessmann S, Baesens B, Seow HV, Thomas LC (2015) Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. Eur. J. Oper. Res. 247(1):124-136.Crossref, Google Scholar · Zbl 1346.90835 · doi:10.1016/j.ejor.2015.05.030
[52] Lewis DD (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. European Conference on Machine Learning (Springer, Berlin, Heidelberg), 4-15.Crossref, Google Scholar · doi:10.1007/BFb0026666
[53] Lim TS, Loh WY, Shih YS (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learn. 40(3):203-228.Crossref, Google Scholar · Zbl 0969.68669 · doi:10.1023/A:1007608224229
[54] McCormick K, Abbott D, Brown MS, Khabaza T, Mutchler SR (2013) IBM SPSS Modeler Cookbook (Packt Publishing Ltd., Birmingham, UK).Google Scholar
[55] Meng X, Bradley J, Yuvaz B, Sparks E, Venkataraman S, Liu D, et al.. (2016) Mllib: Machine learning in Apache Spark. J. Machine Learn. Res. 17(34):1-7.Google Scholar · Zbl 1360.68697
[56] Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine learning, neural and statistical classification. Accessed May 10, 2016, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.355.Google Scholar · Zbl 0827.68094
[57] Microsoft Azure ML (2016) Accessed May 10, 2016, https://azure.microsoft.com/en-us/documentation/services/machine-learning/; https://studio.azureml.net/.Google Scholar
[58] Miller RG Jr (1997) Beyond ANOVA: Basics of Applied Statistics (CRC Press, Boca Raton, FL).Crossref, Google Scholar · doi:10.1201/b15236
[59] Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62:22-31.Crossref, Google Scholar · doi:10.1016/j.dss.2014.03.001
[60] Mund S (2015) Microsoft Azure Machine Learning (Packt Publishing Ltd., Birmingham, UK), Accessed May 10, 2016, https://www.packtpub.com/big-data-and-business-intelligence/microsoft-azure-machine-learningGoogle Scholar
[61] Nemenyi P (1962) Distribution-free multiple comparisons. Biometrics 18(2):263.Google Scholar
[62] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al.. (2011) Scikit-learn: Machine learning in Python. J. Machine Learn. Res. 12:2825-2830.Google Scholar · Zbl 1280.68189
[63] Pentreath N (2015) Machine Learning with Spark (Packt Publishing Ltd., Birmingham, UK).Google Scholar
[64] Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436-442.Crossref, Google Scholar · doi:10.1038/415436a
[65] Quinlan JR (1986) Induction of decision trees. Machine Learn. 1(1):81-106.Crossref, Google Scholar · doi:10.1007/BF00116251
[66] Rokach L, Maimon O (2014) Data Mining with Decision Trees: Theory and Applications (World Scientific, Hackensack, NJ).Crossref, Google Scholar · Zbl 1305.68004 · doi:10.1142/9097
[67] Rumelhart DE, Hinton GE, Williams RJ (1988) Learning representations by back-propagating errors. Nature 323:533-536.Crossref, Google Scholar · Zbl 1369.68284 · doi:10.1038/323533a0
[68] Rumelhart DE, Durbin R, Golden R, Chauvin Y (1995) Backpropagation: The basic theory. Chauvin Y, Rumelhart DE, eds. Backpropagation: Theory, Architectures, and Applications (Lawrence Erlbaum Associates, Hillsdale, NJ), 1-34.Google Scholar
[69] Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507-2517.Crossref, Google Scholar · doi:10.1093/bioinformatics/btm344
[70] SAS Institute, Inc. (2015) Getting Started with SAS® Enterprise Miner™ 14.1 (SAS Institute, Inc., Cary, NC).Google Scholar
[71] Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, et al.. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203-209.Crossref, Google Scholar · doi:10.1016/S1535-6108(02)00030-2
[72] Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21(5):631-643.Crossref, Google Scholar · doi:10.1093/bioinformatics/bti033
[73] Stoica I, Zaharia M (2016) Introducing Databricks Community Edition: Apache Spark for all. Accessed May 10, 2016, https://databricks.com/blog/2016/02/17/introducing-databricks-community-edition-apache-spark-for-all.html.Google Scholar
[74] Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc. Ser. B 36:111-147.Google Scholar · Zbl 0308.62063
[75] Suykens JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process. Lett. 9(3):293-300.Crossref, Google Scholar · doi:10.1023/A:1018628609742
[76] Wahono RS, Herman NS, Ahmad S (2014) A comparison framework of classification models for software defect prediction. Adv. Sci. Lett. 20(10-11):1945-1950.Crossref, Google Scholar · doi:10.1166/asl.2014.5640
[77] Weis M, Rumpf T, Gerhards R, Plümer L (2009) Comparison of different classification algorithms for weed detection from images based on shape parameters. Bornimer Agrartechn. Ber. 69:53-64.Google Scholar
[78] White T (2012) Hadoop: The Definitive Guide (O’Reilly Media, Inc., Sebastopol, CA).Google Scholar
[79] Yeh IC, Lien CH (2009) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36(2):2473-2480.Crossref, Google Scholar · doi:10.1016/j.eswa.2007.12.020
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.