×

Regularizing conjunctive features for classification. (English) Zbl 1477.68087

Summary: We consider the feature-generation task wherein we are given a database with entities labeled as positive and negative examples, and we want to find feature queries that linearly separate the two sets of examples. We focus on conjunctive feature queries, and explore two problems: (a) deciding if separating feature queries exist (separability), and (b) generating such queries when they exist. To restrict the complexity of the generated classifiers, we explore various ways of regularizing them by limiting their dimension, the number of joins in feature queries, and their generalized hypertreewidth (ghw). We show that the separability problem is tractable for bounded ghw; yet, the generation problem is not because feature queries might be too large. So, we explore a third problem: classifying new entities without necessarily generating the feature queries. Interestingly, in the case of bounded ghw we can efficiently classify without explicitly generating such queries.

MSC:

68P15 Database theory
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68Q25 Analysis of algorithms and problem complexity

Software:

glasso
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Ahmed, F.; Samorani, M.; Bellinger, C.; Zaïane, O. R., Advantage of integration in big data: feature generation in multi-relational databases for imbalanced learning, (BigData (2016), IEEE), 532-539
[2] Alpaydin, E., Introduction to Machine Learning (2009), MIT Press
[3] Antonopoulos, T.; Neven, F.; Servais, F., Definability problems for graph query languages, (ICDT (2013)), 141-152
[4] Arenas, M.; Diaz, G. I., The exact complexity of the first-order logic definability problem, ACM Trans. Database Syst., 41, Article 13 pp. (2016) · Zbl 1474.68074
[5] Aronov, B.; Garijo, D.; Núñez-Rodríguez, Y.; Rappaport, D.; Seara, C.; Urrutia, J., Minimizing the error of linear separators on linearly inseparable data, Discrete Appl. Math., 160, 1441-1452 (2012) · Zbl 1243.68158
[6] Barceló, P.; Romero, M., The complexity of reverse engineering problems for conjunctive queries, (ICDT (2017)), Article 7 pp. · Zbl 1402.68040
[7] Bonifati, A.; Martens, W.; Timm, T., An analytical study of large SPARQL query logs, Proc. VLDB Endow., 11, 149-161 (2017)
[8] Cappuzzo, R.; Papotti, P.; Thirumuruganathan, S., Creating embeddings of heterogeneous relational datasets for data integration tasks, (Maier, D.; Pottinger, R.; Doan, A.; Tan, W.; Alawini, A.; Ngo, H. Q., SIGMOD (2020)), 1335-1349
[9] ten Cate, B.; Dalmau, V., The product homomorphism problem and applications, (ICDT (2015)), 161-176 · Zbl 1365.68296
[10] Chapelle, O.; Haffner, P.; Vapnik, V., Support vector machines for histogram-based image classification, IEEE Trans. Neural Netw., 10, 1055-1064 (1999)
[11] Chen, H.; Dalmau, V., Beyond hypertree width: decomposition methods without decompositions, (CP (2005)), 167-181 · Zbl 1153.68452
[12] Feldman, V.; Guruswami, V.; Raghavendra, P.; Wu, Y., Agnostic learning of monomials by halfspaces is hard, SIAM J. Comput., 41, 1558-1590 (2012) · Zbl 1261.68063
[13] Flum, J.; Grohe, M., Parameterized Complexity Theory (2006), Springer · Zbl 1143.68016
[14] Friedman, J.; Hastie, T.; Tibshirani, R., Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, 432-441 (2008) · Zbl 1143.62076
[15] Gottlob, G.; Greco, G.; Leone, N.; Scarcello, F., Hypertree decompositions: questions and answers, (PODS (2016)), 57-74
[16] Gottlob, G.; Leone, N.; Scarcello, F., Hypertree decompositions and tractable queries, J. Comput. Syst. Sci., 64, 579-627 (2002) · Zbl 1052.68025
[17] Grohe, M., word2vec, node2vec, graph2vec, x2vec: towards a theory of vector embeddings of structured data, (Suciu, D.; Tao, Y.; Wei, Z., SIGMOD (2020)), 1-16
[18] Grohe, M.; Löding, C.; Ritzert, M., Learning mso-definable hypotheses on strings, (ALT, PMLR (2017)), 434-451 · Zbl 1403.68094
[19] Grohe, M.; Marx, D., Constraint solving via fractional edge covers, ACM Trans. Algorithms, 11, Article 4 pp. (2014) · Zbl 1398.68240
[20] Grohe, M.; Ritzert, M., Learning first-order definable concepts over structures of small degree, (LICS, IEEE Computer Society. (2017)), 1-12 · Zbl 1457.68133
[21] Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L. A., Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) (2006), Springer-Verlag: Springer-Verlag New York, Inc., Secaucus, NJ, USA
[22] Höffgen, K.; Simon, H. U.; Horn, K. S.V., Robust trainability of single neurons, J. Comput. Syst. Sci., 50, 114-125 (1995) · Zbl 0939.68771
[23] Kandel, S.; Paepcke, A.; Hellerstein, J. M.; Heer, J., Enterprise data analysis and visualization: an interview study, IEEE Trans. Vis. Comput. Graph., 18, 2917-2926 (2012)
[24] Karmarkar, N., A new polynomial-time algorithm for linear programming, Combinatorica, 4, 373-396 (1984) · Zbl 0557.90065
[25] Kazemi, S. M.; Fatemi, B.; Kim, A.; Peng, Z.; Tora, M. R.; Zeng, X.; Dirks, M. C.; Poole, D., Comparing aggregators for relational probabilistic models (2017), CoRR
[26] Keller, J. M.; Gray, M. R.; Givens, J. A., A fuzzy k-nearest neighbor algorithm, IEEE Trans. Syst. Man Cybern. Syst., 15, 580-585 (1985)
[27] Khachiyan, L., A polynomial algorithm in linear programming, Sov. Math. Dokl., 20, 191-194 (1979) · Zbl 0409.90079
[28] Kimelfeld, B.; Ré, C., A relational framework for classifier engineering, (PODS (2017)), 5-20
[29] Knobbe, A. J.; de Haas, M.; Siebes, A., Propositionalisation and aggregates, (PKDD (2001)), 277-288 · Zbl 1009.68749
[30] Kolaitis, P. G.; Panttaja, J., On the complexity of existential pebble games, (CSL (2003)), 314-329 · Zbl 1116.68474
[31] Lam, H. T.; Minh, T. N.; Sinn, M.; Buesser, B.; Wistuba, M., Learning features for relational data (2018), URL:
[32] Marx, D., Tractable hypergraph properties for constraint satisfaction and conjunctive queries, J. ACM, 60, Article 42 pp. (2013) · Zbl 1281.68135
[33] Mohri, M.; Rostamizadeh, A.; Talwalkar, A., Foundations of Machine Learning (2012), MIT Press · Zbl 1318.68003
[34] Mudgal, S.; Li, H.; Rekatsinas, T.; Doan, A.; Park, Y.; Krishnan, G.; Deep, R.; Arcaute, E.; Raghavendra, V., Deep learning for entity matching: a design space exploration, (Das, G.; Jermaine, C. M.; Bernstein, P. A., SIGMOD (2018)), 19-34
[35] Murthy, C. A., Bridging feature selection and extraction: compound feature generation, IEEE Trans. Knowl. Data Eng., 29, 757-770 (2017)
[36] Perlich, C.; Provost, F. J., Distribution-based aggregation for relational learning with identifier attributes, Mach. Learn., 62, 65-105 (2006) · Zbl 1470.68158
[37] Pontil, M.; Verri, A., Support vector machines for 3D object recognition, IEEE Trans. Pattern Anal. Mach. Intell., 20, 637-646 (1998)
[38] Ribeiro, M. T.; Singh, S.; Guestrin, C., “Why should I trust you?”: explaining the predictions of any classifier, (KDD (2016), ACM), 1135-1144
[39] Ribeiro, M. T.; Singh, S.; Guestrin, C., Anchors: high-precision model-agnostic explanations, (AAAI (2018), AAAI Press), 1527-1535
[40] Samorani, M.; Laguna, M.; DeLisle, R. K.; Weaver, D. C., A randomized exhaustive propositionalization approach for molecule classification, INFORMS J. Comput., 23, 331-345 (2011) · Zbl 1243.62095
[41] Schölkopf, B.; Smola, A. J., Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Adaptive Computation and Machine Learning Series (2002), MIT Press
[42] Shalev-Shwartz, S.; Ben-David, S., Understanding Machine Learning: From Theory to Algorithms (2014), Cambridge University Press: Cambridge University Press New York, NY, USA · Zbl 1305.68005
[43] Weiss, Y. Y.; Cohen, S., Reverse engineering spj-queries from examples, (PODS (2017)), 151-166
[44] Willard, R., Testing expressibility is hard, (CP (2010)), 9-23
[45] Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P. S., A comprehensive survey on graph neural networks (2019), URL:
[46] Zhang, C.; Kumar, A.; Ré, C., Materialization optimizations for feature selection workloads, (SIGMOD Conference (2014)), 265-276
[47] Ziarko, W., Variable precision rough set model, J. Comput. Syst. Sci., 46, 39-59 (1993) · Zbl 0764.68162
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.