zbMATH — the first resource for mathematics

Data scarcity, robustness and extreme multi-label classification. (English) Zbl 07097472
Summary: The goal in extreme multi-label classification (XMC) is to learn a classifier which can assign a small subset of relevant labels to an instance from an extremely large set of target labels. The distribution of training instances among labels in XMC exhibits a long tail, implying that a large fraction of labels have a very small number of positive training instances. Detecting tail-labels, which represent diversity of the label space and account for a large fraction (upto 80%) of all the labels, has been a significant research challenge in XMC. In this work, we pose the tail-label detection task in XMC as robust learning in the presence of worst-case perturbations. This viewpoint is motivated by a key observation that there is a significant change in the distribution of the feature composition of instances of these labels from the training set to test set. For shallow classifiers, our robustness perspective to XMC naturally motivates the well-known \(\ell _1\)-regularized classification. Contrary to the popular belief that Hamming loss is unsuitable for tail-labels detection in XMC, we show that minimizing (convex upper bound on) Hamming loss with appropriate regularization surpasses many state-of-the-art methods. Furthermore, we also highlight the sub-optimality of the co-ordinate descent based solver in the LibLinear package, which, given its ubiquity, is interesting in its own right. We also investigate the spectral properties of label graphs for providing novel insights towards understanding the conditions governing the performance of Hamming loss based one-vs-rest scheme vis-à-vis label embedding methods.

68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI
[1] Agrawal, R., Gupta, A., Prabhu, Y., & Varma, M. (2013). Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW (pp. 13-24).
[2] Babbar, R., & Schölkopf, B. (2017). Dismec: Distributed sparse machines for extreme multi-label classification. In WSDM (pp. 721-729).
[3] Babbar, R., Partalas, I., Gaussier, E., & Amini, M.-R. (2013). On flat versus hierarchical classification in large-scale taxonomies. In Advances in Neural Information Processing Systems (pp. 1824-1832).
[4] Babbar, R.; Metzig, C.; Partalas, I.; Gaussier, E.; Amini, M-R, On power law distributions in large-scale taxonomies, ACM SIGKDD Explorations Newsletter, 16, 47-56, (2014)
[5] Babbar, R.; Partalas, I.; Gaussier, E.; Amini, M-R; Amblard, C., Learning taxonomy adaptation in large-scale classifcation, The Journal of Machine Learning Research, 17, 3350-3386, (2016) · Zbl 1367.68218
[6] Bach, F.; Jenatton, R.; Mairal, J.; Obozinski, G., Convex optimization with sparsity-inducing norms, Optimization for Machine Learning, 5, 19-53, (2011) · Zbl 06064248
[7] Bengio, S., Weston, J., & Grangier, D. (2010). Label embedding trees for large multi-class tasks. In Neural information processing systems (pp. 163-171).
[8] Bhatia, K., Jain, H., Kar, P., Varma, M., & Jain, P. (2015). Sparse local embeddings for extreme multi-label classification. In C. Cortes, N. D. Lawrence, D.D. Lee, M. Sugiyama, & R. Garnett (Eds.), NIPS (pp. 730-738).
[9] Bhatia, K., Dahiya, K., Jain, H., Prabhu, Y., & Varma, M. (2016). The extreme classification repository: Multi-label datasets and code. http://manikvarma.org/downloads/XC/XMLRepository.html.
[10] Bi, W., & Kwok, J. (2013). Efficient multi-label classification with many labels. In Proceedings of The 30th international conference on machine learning (pp. 405-413).
[11] Chen, Y.-N., & Lin, H.-T. (2012). Feature—aware label space dimension reduction for multi-label classification. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), NIPS (pp. 1529-1537).
[12] Chung, F. R. (1997). Spectral graph theory. Providence: American Mathematical Society. · Zbl 0867.05046
[13] Combettes, PL; Pesquet, J-C, A douglas-rachford splitting approach to nonsmooth convex variational signal recovery, IEEE Journal of Selected Topics in Signal Processing, 1, 564, (2007)
[14] Daume III, H., Karampatziakis, N., Langford, J., & Mineiro, P. (2016). Logarithmic time one-against-some. arXiv preprint. arXiv:1606.04988.
[15] Deng, J.; Berg, AC; Li, K.; Fei-Fei, L.; Daniilidis, K. (ed.); Maragos, P. (ed.); Paragios, N. (ed.), What does classifying more than 10,000 image categories tell us?, 71-84, (2010), Berlin
[16] Denton, E., Weston, J., Paluri, M., Bourdev, L., & Fergus, R. (2015). User conditional hashtag prediction for images. In KDD.
[17] Fan, R-E; Chang, K-W; Hsieh, C-J; Wang, X-R; Lin, C-J, Liblinear: A library for large linear classification, The Journal of Machine Learning Research, 9, 1871-1874, (2008) · Zbl 1225.68175
[18] Gaure, A., Gupta, A., Verma, V. K., & Rai, P. (2017). A probabilistic framework for zero-shot multi-label learning. In UAI.
[19] Globerson, A., & Roweis, S. (2006). Nightmare at test time: robust learning by feature deletion. In W. W. Cohen & A. Moore (Eds.), Proceedings of the 23rd international conference on machine learning (pp. 353-360). ACM.
[20] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint. arXiv:1412.6572.
[21] Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (pp. 772-780).
[22] Huang, H.-Y., & Lin, C.-J. (2016). Linear and kernel classification: When to use which? In SDM (pp. 216-224). SIAM.
[23] Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme multi-label loss functions for recommendation, tagging, ranking and other missing label applications. In KDD.
[24] Jain, V., Modhe, N., & Rai, P. (2017). Scalable generative models for multi-label learning with missing labels. In ICML.
[25] Jasinska, K., Dembczynski, K., Busa-Fekete, R., Pfannschmidt, K., Klerx, T., & Hüllermeier, E. (2016). Extreme f-measure maximization using sparse probability estimates. In ICML.
[26] Jernite, Y., Choromanska, A., Sontag, D., & LeCun, Y. (2016). Simultaneous learning of trees and representations for extreme classification, with application to language modeling. arXiv preprint. arXiv:1610.04658.
[27] Lin, Z., Ding, G., Hu, M., & Wang, J. (2014). Multi-label classification via feature-aware implicit label space encoding. (pp. 325-333).
[28] Liu, J., Chang, W.-C., Wu, Y., & Yang, Y. (2017). Deep learning for extreme multi-label text classification. In SIGIR (pp. 115-124). ACM.
[29] McAuley, J., & Leskovec, J. (2013). Hidden factors and hidden topics: Understanding rating dimensions with review text. In RecSys (pp. 165-172). ACM.
[30] Mencia, E. L., & Fürnkranz, J. (2008). Efficient pairwise multilabel classification for large-scale problems in the legal domain. In ECML-PKDD.
[31] Nam, J., Mencía, E. L., Kim, H. J., & Fürnkranz, J. (2017). Maximizing subset accuracy with recurrent neural networks in multi-label classification. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (pp. 5413-5423).
[32] Niculescu-Mizil, A., & Abbasnejad, E. (2017). Label Filters for large scale multilabel classification. In AISTATS, proceedings of machine learning research, (pp. 1448-1457). Fort Lauderdale.
[33] Papanikolaou, Y., & Tsoumakas, G. (2017). Subset labeled lda for large-scale multi-label classification. arXiv preprint. arXiv:1709.05480.
[34] Prabhu, Y., & Varma, M. (2014). Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD, (pp. 263-272). ACM.
[35] Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., & Varma, M. (2018). Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In WWW.
[36] Radlinski, F., Bennett, P. N., Carterette, B., & Joachims, T. (2009). Redundancy, diversity and interdependent document relevance. In ACM SIGIR Forum.
[37] Shaham, U., Yamada, Y., & Negahban, S. (2015). Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint. arXiv:1511.05432.
[38] Shani, G.; Gunawardana, A., Tutorial onapplication-oriented evaluation of recommendation systems, AICommunications, 26, 225-236, (2013)
[39] Si, S., Zhang, H., Keerthi, S. S., Mahajan, D., Dhillon, I. S., & Hsieh, C.-J. (2017). Gradient boosted decision trees for high dimensional sparse output. In ICML.
[40] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint. arXiv:1312.6199.
[41] Tagami, Y. (2017). Annexml: Approximate nearest neighbor search for extreme multi-label classification. In KDD, ACM.
[42] Tai, F.; Lin, H-T, Multilabel classification with principal label space transformation, Neural Computation, 24, 2508-2542, (2012) · Zbl 1269.68084
[43] Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., & Madry, A. (2019). Robustness may be at odds with accuracy. In International conference on learning representations. https://openreview.net/forum?id=SyxAb30cY7.
[44] Tsoumakas, G., Katakis, I., & Vlahavas, I. (2009). Miningmulti-label data. In Data mining and knowledge discoveryhandbook, Springer.
[45] Xu, H.; Caramanis, C.; Mannor, S., Robustness and regularization of support vector machines, Journal of Machine Learning Research, 10, 1485-1510, (2009) · Zbl 1235.68209
[46] Xu, H.; Caramanis, C.; Mannor, S., Robust regression and lasso, IEEE Transactions on Information Theory, 7, 3561-3574, (2010) · Zbl 1366.62147
[47] Yen, I. E., Huang, X., Ravikumar, P., Zhong, K., & Dhillon, I. S. (2016). Pd-sparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML.
[48] Yu, H.-F., Jain, P., Kar, P., & Dhillon, I. (2014). Large-scale multi-label learning with missing labels. In ICML (pp. 593-601).
[49] Zhang, Y., & Schneider, J. (2011). Multi-label output codes using canonical correlation analysis. (pp. 873-882).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.