zbMATH — the first resource for mathematics

A polynomial-time maximum common subgraph algorithm for outerplanar graphs and its application to chemoinformatics. (English) Zbl 1357.68184
Summary: Metrics for structured data have received an increasing interest in the machine learning community. Graphs provide a natural representation for structured data, but a lot of operations on graphs are computationally intractable. In this article, we present a polynomial-time algorithm that computes a maximum common subgraph of two outerplanar graphs. The algorithm makes use of the block-and-bridge preserving subgraph isomorphism, which has significant efficiency benefits and is also motivated from a chemical perspective. We focus on the application of learning structure-activity relationships, where the task is to predict the chemical activity of molecules. We show how the algorithm can be used to construct a metric for structured data and we evaluate this metric and more generally also the block-and-bridge preserving matching operator on 60 molecular datasets, obtaining state-of-the-art results in terms of predictive performance and efficiency.

68T05 Learning and adaptive systems in artificial intelligence
05C85 Graph algorithms (graph-theoretic aspects)
68W40 Analysis of algorithms
92E10 Molecular structure (graph-theoretic methods, methods of differential topology, etc.)
gSpan; AFGen
Full Text: DOI
[1] Akutsu, T, A polynomial time algorithm for finding a largest common subgraph of almost trees of bounded degree, IEICE Trans. Fundam. Electron. Commun. Comput. Sci., E76-A, 1488-1493, (1993)
[2] Bringmann, B., Zimmermann, A., De Raedt, L., Nijssen, S.: Don’t be afraid of simpler patterns. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 55-66 (2006) · Zbl 1006.68857
[3] Bunke, H; Shearer, K, A graph distance metric based on the maximal common subgraph, Pattern Recogn. Lett., 19, 255-259, (1998) · Zbl 0905.68128
[4] Cao, Y; Jiang, T; Girke, T, A maximum common substructure-based algorithm for searching and predicting drug-like compounds, Bioinformatics, 24, 366-374, (2008)
[5] Ceroni, A; Costa, F; Frasconi, P, Classification of small molecules by two- and three-dimensional decomposition kernels, Bioinformatics, 23, 2038-2045, (2007)
[6] Chaoji, V; Al Hasan, M; Salem, S; Besson, J; Zaki, MJ, Origami: A novel and effective approach for mining representative orthogonal graph patterns, Stat. Anal. Data Min., 1, 67-84, (2008)
[7] Chi, Y; Muntz, RR; Nijssen, S; Kok, JN, Frequent subtree mining—an overview, Fundam. Inform., 66, 161-198, (2005) · Zbl 1096.68044
[8] Conte, D; Foggia, P; Sansone, C; Vento, M, Thirty years of graph matching in pattern recognition, Int. J. Pattern Recogn. Artif. Intell., 18, 265-298, (2004)
[9] De Raedt, L.: Logical and Relational Learning. Springer (2008) · Zbl 1203.68145
[10] Raedt, L; Ramon, J, Deriving distance metrics from generality relations, Pattern Recogn. Lett., 30, 187-191, (2009)
[11] Demšar, J, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7, 1-30, (2006) · Zbl 1222.68184
[12] Deshpande, M; Kuramochi, M; Wale, N; Karypis, G, Frequent substructure-based approaches for classifying chemical compounds, IEEE Trans. Knowl. Data Eng., 17, 1036-1050, (2005)
[13] Diestel, R.: Graph Theory. Springer-Verlag (2000) · Zbl 0945.05002
[14] Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman and Co. (1979) · Zbl 0411.68039
[15] Gärtner, T.: Kernels for Structured Data. World Scientific (2008)
[16] Hansch, C; Maolney, PP; Fujita, T; Muir, RM, Correlation of biological activity of phenoxyacetic acids with hammett substituent constants and partition coefficients, Nature, 194, 178-180, (1962)
[17] He, H; Singh, AK, Graphrank: statistical modeling and mining of significant subgraphs in the feature space, 885-890, (2006), Washington, DC
[18] Helma, C; Kramer, S; Raedt, L, Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds, J. Chem. Inf. Model., 44, 1402-1411, (2004)
[19] Hopcroft, JE; Karp, RM, A \(n\)\^{}{5/2} algorithm for maximum matching in bipartite graphs, SIAM J. Comput., 2, 225-231, (1973) · Zbl 0266.05114
[20] Horváth, T., Gärtner, T., Wrobel, S.: Cyclic pattern kernels for predictive graph mining. In: KDD ’04: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 158-167 (2004)
[21] Horváth, T., Ramon, J., Wrobel, S.: Frequent subgraph mining in outerplanar graphs. In: KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 197-206. Philadelphia, PA (2006)
[22] Horváth, T; Ramon, J; Wrobel, S, Frequent subgraph mining in outerplanar graphs, Data Min. Knowl. Discov., 21, 472-508, (2010)
[23] Joachims, T.: Learning to Classify Text using Support Vector Machines: Methods, Theory, and Algorithms. Springer (2002) · Zbl 0466.68053
[24] Johnson, M.A., Maggiora, G.M.: Concepts and Applications of Molecular Similarity. John Wiley (1990)
[25] Karunaratne, T., Boström, H.: Learning to classify structured data by graph propositionalization. In: Proceedings of the 2nd IASTED International Conference on Computational Intelligence, pp. 393-398 (2006) · Zbl 1222.68184
[26] King, RD; Muggleton, S; Srinivasan, A; Sternberg, MJE, Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming, Proc. Natl. Acad. Sci., 93, 438-442, (1996)
[27] Koch, I, Enumerating all connected maximal common subgraphs in two graphs, Theor. Comput. Sci., 250, 1-30, (2001) · Zbl 0952.68105
[28] Kramer, S., De Raedt, L., Helma, C.: Molecular feature mining in HIV data. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-01), pp. 136-143. ACM Press (2001) · Zbl 0949.68122
[29] Kramer, S., Lavrač, N., Flach, P.: Propositionalization approaches to relational data mining. In: Džeroski, S., Lavrač, N. (eds.) Relational Data Mining, pp. 262-291. Springer-Verlag (2001)
[30] Lingas, A, Subgraph isomorphism for biconnected outerplanar graphs in cubic time, Theor. Comput. Sci., 63, 295-302, (1989) · Zbl 0681.68090
[31] Maunz, A; Helma, C; Kramer, S, Large-scale graph mining using backbone refinement classes, 617-626, (2009), New York, NY
[32] McGregor, JJ, Backtrack search algorithms and the maximal common subgraph problem, Softw. Pract. Exp., 12, 23-34, (1982) · Zbl 0466.68053
[33] Mitchell, SL, Linear algorithms to recognize outerplanar and maximal outerplanar graphs, Inf. Process. Lett., 9, 229-232, (1979) · Zbl 0444.68055
[34] Munkres, J, Algorithms for the assignment and transportation problems, J. Soc. Ind. Appl. Math., 5, 32-38, (1957) · Zbl 0083.15302
[35] Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 647-652 (2004)
[36] Raymond, J; Gardiner, E; Willett, P, Rascal: calculation of graph similarity using maximum common edge subgraphs, Comput. J., 45, 631-644, (2002) · Zbl 1037.68101
[37] Raymond, J; Willett, P, Effectiveness of graph-based and fingerprint-based similarity measures for virtual screening of 2D chemical structure databases, J. Comput. Aided Mol. Des., 16, 59-71, (2002)
[38] Raymond, J; Willett, P, Maximum common subgraph isomorphism algorithms for the matching of chemical structures, J. Comput. Aided Mol. Des., 16, 521-533, (2002)
[39] Schietgat, L., Ramon, J., Bruynooghe, M., Blockeel, H.: An efficiently computable graph-based metric for the classification of small molecules. In: Proceedings of the 11th International Conference on Discovery Science, vol. 5255 of Lecture Notes in Artificial Intelligence, pp. 197-209 (2008)
[40] Schietgat, L; Costa, F; Ramon, J; Raedt, L, Effective feature construction by maximum common subgraph sampling, Mach. Learn., 83, 137-161, (2011) · Zbl 1237.68162
[41] Shamir, R; Tsur, D, Faster subtree isomorphism, J. Algorithms, 33, 267-280, (1992) · Zbl 0949.68122
[42] Shearer, K; Bunke, H; Venkatesh, S, Video indexing and similarity retrieval by largest common subgraph detection using decision trees, Pattern Recogn., 34, 1075-1091, (2001) · Zbl 1006.68857
[43] Shervashidze, N., Borgwardt, K.: Fast subtree kernels on graphs. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 1660-1668 (2009) · Zbl 0444.68055
[44] Swamidass, SJ; Chen, J; Bruand, J; Phung, P; Ralaivola, L; Baldi, P, Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity, Bioinformatics, 21, 359-368, (2005)
[45] Syslo, M, The subgraph isomorphism problem for outerplanar graphs, Theor. Comp. Sci., 17, 91-97, (1982) · Zbl 0522.68061
[46] Wale, N; Watson, IA; Karypis, G, Comparison of descriptor spaces for chemical compound retrieval and classification, Knowl. Inf. Syst., 14, 347-375, (2008)
[47] Wilcoxon, F, Individual comparisons by ranking methods, Biometrics, 1, 80-83, (1945)
[48] Willett, P, Similarity-based virtual screening using 2D fingerprints, Drug Discov. Today, 11, 1046-1051, (2006)
[49] Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), pp. 721-724. IEEE Computer Society (2002)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.