×

zbMATH — the first resource for mathematics

Making decision trees feasible in ultrahigh feature and label dimensions. (English) Zbl 1434.68425
Summary: Due to the non-linear but highly interpretable representations, decision tree (DT) models have significantly attracted a lot of attention of researchers. However, it is difficult to understand and interpret DT models in ultrahigh dimensions and DT models usually suffer from the curse of dimensionality and achieve degenerated performance when there are many noisy features. To address these issues, this paper first presents a novel data-dependent generalization error bound for the perceptron decision tree (PDT), which provides the theoretical justification to learn a sparse linear hyperplane in each decision node and to prune the tree. Following our analysis, we introduce the notion of budget-aware classifier (BAC) with a budget constraint on the weight coefficients, and propose a supervised budgeted tree (SBT) algorithm to achieve non-linear prediction performance. To avoid generating an unstable and complicated decision tree and improve the generalization of the SBT, we present a pruning strategy by learning classifiers to minimize cross-validation errors on each BAC. To deal with ultrahigh label dimensions, based on three important phenomena of real-world data sets from a variety of application domains, we develop a sparse coding tree framework for multi-label annotation problems and provide the theoretical analysis. Extensive empirical studies verify that 1) SBT is easy to understand and interpret in ultrahigh dimensions and is more resilient to noisy features. 2) Compared with state-of-the-art algorithms, our proposed sparse coding tree framework is more efficient, yet accurate in ultrahigh label and feature dimensions.

MSC:
68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDF BibTeX XML Cite
Full Text: Link
References:
[1] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW, pages 13–24, May 2013.
[2] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In ICML, page 6, 2004.
[3] A.-L. Barab´asi, R. Albert, H. Jeong, and G. Bianconi. Power-law distribution of the world wide web. Science, 287(54561):2115, 2000.
[4] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, 2002. · Zbl 1084.68549
[5] Peter L. Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In Advances in Kernel Methods - Support Vector Learning, pages 43–54. MIT Press, Cambridge, MA, USA, 1998.
[6] Kristin P. Bennett, Nello Cristianini, John Shawe-Taylor, and Donghui Wu. Enlarging the margins in perceptron decision trees. Machine Learning, 41(3):295–313, 2000. · Zbl 0971.68079
[7] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multi-label classification. In NIPS, pages 730–738, 2015.
[8] Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. Learning multilabel scene classification. Pattern Recognition, 37(9):1757–1771, 2004. 31
[9] Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001. · Zbl 1007.68152
[10] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. CA: Wadsworth International Group, 1984. · Zbl 0541.62042
[11] Serhat Selcuk Bucak, Rong Jin, and Anil K. Jain. Multiple kernel learning for visual object recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1354–1369, 2014.
[12] Yao-Nan Chen and Hsuan-Tien Lin. Feature-aware label space dimension reduction for multi-label classification. In NIPS, pages 1538–1546, 2012.
[13] Marcin Czajkowski, Marek Grzes, and Marek Kretowski. Multi-test decision tree and its application to microarray data classification. Artificial Intelligence in Medicine, 61(1): 35–44, 2014.
[14] Jia Deng, Olga Russakovsky, Jonathan Krause, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Scalable multi-label annotation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3099–3102, 2014.
[15] Bo Du, Mengfei Zhang, Lefei Zhang, Ruimin Hu, and Dacheng Tao. PLTD: patch-based low-rank tensor decomposition for hyperspectral images. IEEE Trans. Multimedia, 19(1): 67–79, 2017.
[16] Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional feature selection: Beyond the linear model. JMLR, 10:2013–2038, 2009. · Zbl 1235.62089
[17] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Lin Chih-Jen. Liblinear: A library for large linear classification. JMLR, 9:1871–1874, 2008. · Zbl 1225.68175
[18] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189–1232, 2000. · Zbl 1043.62034
[19] Mehmet G¨onen and Ethem Alpaydin. Localized multiple kernel learning. In ICML, pages 352–359, 2008. · Zbl 1254.68204
[20] Chen Gong, Dacheng Tao, Wei Liu, Liu Liu, and Jie Yang. Label propagation via teachingto-learn and learning-to-teach. IEEE Trans. Neural Netw. Learning Syst., 28(6):1452– 1465, 2017.
[21] Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In ICCV, pages 309–316, 2009.
[22] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. New York: Springer New York Inc, 2001. · Zbl 0973.62007
[23] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers, 40(9):1098–1101, 1952.
[24] Cijo Jose, Prasoon Goyal, Parv Aggrwal, and Manik Varma. Local deep kernel learning for efficient non-linear svm prediction. In ICML, pages 486–494, 2013. 32
[25] Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. In Proceedings of the 31st Symposium on the Foundations of Computer Science, pages 382–391, 1990. · Zbl 0822.68093
[26] J. E. Kelley. The cutting plane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4):703–712, 1960. · Zbl 0098.12104
[27] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans. Information Theory, 47(5):1902–1914, 2001. · Zbl 1008.62614
[28] Oluwasanmi Koyejo, Nagarajan Natarajan, Pradeep Ravikumar, and Inderjit S. Dhillon. Consistent multilabel classification. In NIPS, pages 3321–3329, 2015.
[29] Gert R. G. Lanckriet, Tijl De Bie, Nello Cristianini, Michael I. Jordan, and William Stafford Noble. A statistical framework for genomic data fusion. Bioinformatics, 20(16):2626– 2635, 2004a.
[30] Gert R. G. Lanckriet, Nello Cristianini, Peter L. Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004b. · Zbl 1222.68241
[31] Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y. Ng. Efficient l1 regularized logistic regression. In The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, pages 401–403, 2006.
[32] Yu-Feng Li, Ivor W. Tsang, James T. Kwok, and Zhi-Hua Zhou. Convex and scalable weakly labeled svms. Journal of Machine Learning Research, 14(1):2151–2188, 2013. · Zbl 1317.68169
[33] Weiwei Liu and Ivor W. Tsang. Large margin metric learning for multi-label prediction. In AAAI, pages 2800–2806, 2015a.
[34] Weiwei Liu and Ivor W. Tsang. On the optimality of classifier chain for multi-label classification. In NIPS, pages 712–720, 2015b.
[35] Weiwei Liu and Ivor W. Tsang. Sparse perceptron decision tree for millions of dimensions. In AAAI, pages 1881–1887, 2016.
[36] Xinwang Liu, Lei Wang, Jianping Yin, Yong Dou, and Jian Zhang. Absent multiple kernel learning. In AAAI, pages 2807–2813, 2015.
[37] Gjorgji Madjarov, Dragi Kocev, Dejan Gjorgjevikj, and Saso Dzeroski. An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45(9): 3084–3104, Sep 2012.
[38] Qi Mao, Ivor Wai-Hung Tsang, and Shenghua Gao. Objective-guided image annotation. IEEE Transactions on Image Processing, 22(4):1585–1597, 2013. · Zbl 1373.94277
[39] Julian J. McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substitutable and complementary products. In SIGKDD, pages 785–794, 2015a. 33
[40] Julian J. McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Imagebased recommendations on styles and substitutes. In SIGIR, pages 43–52, 2015b.
[41] Robert J. McQueen, Garner S.R., Nevill-Manning C.G., and Witten I.H. Applying machine learning to agricultural data. Computers and Electronics in Agriculture, 12(4):275–293, 1995.
[42] Shahar Mendelson. Rademacher averages and phase transitions in glivenko-cantelli classes. IEEE Trans. Information Theory, 48(1):251–263, 2002. · Zbl 1059.60027
[43] Joseph J. Mezrich. When is a tree a hedge? Financial Analysts Journal, pages 75–81, 1994.
[44] Tom M. Mitchell, Rebecca A. Hutchinson, Radu Stefan Niculescu, Francisco Pereira, Xuerui Wang, Marcel Adam Just, and Sharlene D. Newman. Learning to decode cognitive states from brain images. Machine Learning, 57(1-2):145–175, 2004. · Zbl 1078.68715
[45] Bernard M. E. Moret. Decision trees and diagrams. Computing Surveys, 14(4):593–623, 1982.
[46] Sreerama K. Murthy, Simon Kasif, and Steven Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2:1–32, 1994. · Zbl 0900.68335
[47] Hidekazu Oiwa and Ryohei Fujimaki. Partition-wise linear models. In NIPS, pages 3527– 3535, 2014.
[48] Jan Paralic and Peter Bednar. Text mining for documents annotation and ontology support, 2003.
[49] E. Y. Pee and Johannes O. Royset. On solving large-scale finite minimax problems using exponential smoothing. Journal of Optimization Theory and Applications, 148(2):390– 421, 2011. · Zbl 1216.90097
[50] John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin dags for multiclass classification. In NIPS, pages 547–553, Denver, Colorado, November 1999. MIT Press.
[51] Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In SIGKDD, pages 263–272, August 2014.
[52] J. Ross Quinlan. C4.5: Programs for Machine Learning. USA: Morgan Kaufmann Publishers Inc, 1993.
[53] Alain Rakotomamonjy, Francis R. Bach, St´ephane Canu, and Yves Grandvalet. Simplemkl. Journal of Machine Learning Research, 9:2491–2521, 2008. · Zbl 1225.68208
[54] Jesse Read, Bernhard Pfahringer, Geoffrey Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333–359, 2011.
[55] Steven Roman. Coding and Information Theory. Springer Verlag, New York, 1992. · Zbl 0752.94001
[56] S. Rasoul Safavian and David A. Landgrebe. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man and Cybernetics, 21(3):660–674, 1991. 34
[57] Steven Salzberg, Rupali Chandar, Holland Ford, Sreerama K. Murthy, and Richard L. White. Decision trees for automated identification of cosmic-ray hits in hubble space telescope images. Publications of the Astronomical Society of the Pacific, 107:1–10, 1995.
[58] Robert E. Schapire and Yoram Singer.Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2-3):135–168, 2000. · Zbl 0951.68561
[59] Claude Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656, 1948. · Zbl 1154.94303
[60] John Shawe-Taylor and Nello Cristianini. Data-dependent structural risk minimization for perceptron decision trees. In NIPS, pages 336–342, 1997.
[61] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. New York: Cambridge University Press, 2004. · Zbl 0994.68074
[62] Shinichi Shimozono, Ayumi Shinohara, Takeshi Shinohara, Satoru Miyano, Satoru Kuhara, and Setsuo Arikawa. Knowledge acquisition from amino acid sequences by machine learning system bonsai. Transactions of the Information Processing Society of Japan, 35(10): 2009–2018, 1994. · Zbl 0862.68091
[63] Maurice Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171–176, 1958. · Zbl 0081.11502
[64] Yan Song, Xian-Sheng Hua, Li-Rong Dai, and Meng Wang. Semi-automatic video annotation based on active learning with multiple complementary predictors. In Multimedia Information Retrieval, pages 97–104, 2005.
[65] S¨oren Sonnenburg, Gunnar R¨atsch, Christin Sch¨afer, and Bernhard Sch¨olkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006. · Zbl 1222.90072
[66] Niranjan A. Subrahmanya and Yung C. Shin. Sparse multiple kernel learning for signal processing applications. IEEE Trans. Pattern Anal. Mach. Intell., 32(5):788–798, 2010.
[67] Mingkui Tan, Ivor W. Tsang, and Li Wang. Towards ultrahigh dimensional feature selection for big data. JMLR, 15(1):1371–1429, 2014. · Zbl 1318.68156
[68] Jinhui Tang, Xian-Sheng Hua, Guo-Jun Qi, Yan Song, and Xiuqing Wu. Video annotation based on kernel linear neighborhood propagation. IEEE Transactions on Multimedia, 10 (4):620–628, 2008.
[69] Grigorios Tsoumakas and Ioannis P. Vlahavas. Random k-labelsets: An ensemble method for multilabel classification. In ECML, pages 406–417, Warsaw, Poland, 2007. SpringerVerlag.
[70] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Effective and efficient multilabel classification in domains with large number of labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), 2008. · Zbl 1132.68603
[71] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P. Vlahavas. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook, pages 667–685. Springer US, 2010. 35
[72] De Wang, Danesh Irani, and Calton Pu. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In 8th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing, pages 40–49, 2012.
[73] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus F. M. Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1–37, 2008.
[74] Zhixiang Eddie Xu, Matt J. Kusner, Kilian Q. Weinberger, and Minmin Chen.Costsensitive tree of classifiers. In ICML, pages 133–141, 2013.
[75] Zhixiang Eddie Xu, Gao Huang, Kilian Q. Weinberger, and Alice X. Zheng. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 522–531, 2014.
[76] Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit S. Dhillon. Pd-sparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML, pages 3069–3077, 2016.
[77] Yiteng Zhai, Yew-Soon Ong, and Ivor W. Tsang. The emerging “big dimensionality”. IEEE Computational Intelligence Magazine, 9(3):14–26, 2014.
[78] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2):894–942, 2010a. · Zbl 1183.62120
[79] Cun-Hui Zhang and Jian Huang. The sparsity and bias of the lasso selection in highdimensional linear regression. Annals of Statistics, 36(4):1567–1594, 2008. · Zbl 1142.62044
[80] Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. TKDE, 26(8):1819–1837, Aug 2014.
[81] Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11:1081–1107, 2010b. · Zbl 1242.68262
[82] Yi Zhang and Jeff G. Schneider. Multi-label output codes using canonical correlation analysis. In Fourteenth International Conference on Artificial Intelligence and Statistics, pages 873–882. JMLR.org, 2011.
[83] Yi Zhang and Jeff G. Schneider. Maximum margin output coding. In ICML, pages 1575– 1582, Edinburgh, Scotland, July 2012. Omnipress.
[84] Ji Zhu, Saharon Rosset, Trevor Hastie, and Robert Tibshirani. 1-norm support vector machines. In NIPS, pages 49–56, 2003. · Zbl 1222.68213
[85] Paul C. Zikopoulos, Dirk deRoos, Krishnan Parasuraman, Thomas Deutsch, David Corrigan, and James Giles. Harness the Power of Big Data – The IBM Big Data Platform. Mcgraw-Hill, New York, 2012.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.