Effective feature construction by maximum common subgraph sampling.

*(English)*Zbl 1237.68162Summary: The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features are then typically employed in predictive models that can be constructed using, for example, SVMs or decision trees. We take a different approach: rather than mining for all optimal local patterns, we extract features from the set of pairwise maximum common subgraphs. The maximum common subgraphs are computed under the block-and-bridge-preserving subgraph isomorphism from the outerplanar examples in polynomial time. We empirically observe a significant increase in predictive performance when using maximum common subgraph features instead of correlated local patterns on 60 benchmark datasets from NCI. Moreover, we show that when we randomly sample the pairs of graphs from which to extract the maximum common subgraphs, we obtain a smaller set of features that still allows the same predictive performance as methods that exhaustively enumerate all possible patterns. The sampling strategy turns out to be a very good compromise between a slight decrease in predictive performance (although still remaining comparable with state-of-the-art methods) and a significant runtime reduction (two orders of magnitude on a popular medium size chemoinformatics dataset). This suggests that maximum common subgraphs are interesting and meaningful features.

##### MSC:

68T05 | Learning and adaptive systems in artificial intelligence |

92C40 | Biochemistry, molecular biology |

68R10 | Graph theory (including graph drawing) in computer science |

PDF
BibTeX
XML
Cite

\textit{L. Schietgat} et al., Mach. Learn. 83, No. 2, 137--161 (2011; Zbl 1237.68162)

Full Text:
DOI

##### References:

[1] | Ben-David, S.; Eiron, N.; Simon, H. U., Limitations of learning via embeddings in Euclidean half spaces, Journal of Machine Learning Research, 3, 441-461, (2002) · Zbl 1084.68551 |

[2] | Bringmann, B., Zimmermann, A., Raedt, L. D., & Nijssen, S. (2006). Don’t be afraid of simpler patterns. In Proceedings of the tenth European conference on principles and practice of knowledge discovery in databases (pp. 55-66). |

[3] | Bunke, H.; Shearer, K., A graph distance metric based on the maximal common subgraph, Pattern Recognition Letters, 19, 255-259, (1998) · Zbl 0905.68128 |

[4] | Ceroni, A.; Costa, F.; Frasconi, P., Classification of small molecules by two- and three-dimensional decomposition kernels, Bioinformatics, 23, 2038-2045, (2007) |

[5] | Chaoji, V.; Al Hasan, M.; Salem, S.; Besson, J.; Zaki, J. M., Origami: a novel and effective approach for mining representative orthogonal graph patterns, Statistical Analysis and Data Mining, 1, 67-84, (2008) |

[6] | De Raedt, L. (2008). Logical and relational learning. Berlin: Springer. · Zbl 1203.68145 |

[7] | Raedt, L.; Ramon, J., Deriving distance metrics from generality relations, Pattern Recognition Letters, 30, 187-191, (2009) · Zbl 1197.47067 |

[8] | Demšar, J., Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7, 1-30, (2006) · Zbl 1222.68184 |

[9] | Deshpande, M.; Kuramochi, M.; Wale, N.; Karypis, G., Frequent substructure-based approaches for classifying chemical compounds, IEEE Transactions on Knowledge and Data Engineering, 17, 1036-1050, (2005) |

[10] | Diestel, R. (2000). Graph theory. Berlin: Springer. · Zbl 0945.05002 |

[11] | Garey, M. R., & Johnson, D. (1979). Computers and intractability: a guide to the theory of NP-completeness. New York: Freeman. · Zbl 0411.68039 |

[12] | Gärtner, T. (2005). Kernels for structured data. PhD thesis, University of Bonn, Germany. |

[13] | Hand, D. J., Measuring classifier performance: a coherent alternative to the area under the ROC curve, Machine Learning, 77, 103-123, (2009) |

[14] | He, H.; Singh, A. K., Graphrank: statistical modeling and mining of significant subgraphs in the feature space, Washington, DC, USA, Las Alamitos |

[15] | Horváth, T., Gärtner, T., & Wrobel, S. (2004). Cyclic pattern kernels for predictive graph mining. In KDD ’04: proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 158-167). |

[16] | Horváth, T., Ramon, J., & Wrobel, S. (2006). Frequent subgraph mining in outerplanar graphs. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, August 2006, pp. 197-206. |

[17] | Joachims, T. (2002). Learning to classify text using support vector machines: methods, theory, and algorithms. Berlin: Springer. |

[18] | Karunaratne, T., & Boström, H. (2006). Learning to classify structured data by graph propositionalization. In Proceedings of the second IASTED international conference on computational intelligence (pp. 393-398). · Zbl 1222.68184 |

[19] | Kramer, S.; Raedt, L.; Helma, C., Molecular feature mining in HIV data, 136-143, (2001), New York |

[20] | Kramer, S.; Lavrač, N.; Flach, P.; Džeroski, S. (ed.); Lavrač, N. (ed.), Propositionalization approaches to relational data mining, 262-291, (2001), Berlin |

[21] | Munkres, J., Algorithms for the assignment and transportation problems, Journal of the Society for Industrial and Applied Mathematics, 5, 32-38, (1957) · Zbl 0083.15302 |

[22] | Plotkin, G., A further note on inductive generalization, No. 6, 101-124, (1971), Edinburgh · Zbl 0261.68042 |

[23] | Provost, F.; Fawcett, T., Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions, 43-48, (1998), Menlo Park |

[24] | Raymond, J.; Willett, P., Maximum common subgraph isomorphism algorithms for the matching of chemical structures, Journal of Computer-Aided Molecular Design, 16, 521-533, (2002) |

[25] | Schietgat, L.; Ramon, J.; Bruynooghe, M.; Blockeel, H., An efficiently computable graph-based metric for the classification of small molecules, No. 5255, 197-209, (2008), Berlin |

[26] | Sebag, M.; Lavrač, N. (ed.); Džeroski, S. (ed.), Distance induction in first order logic, No. 1297, 264-272, (1997), Berlin |

[27] | Swamidass, S. J.; Chen, J.; Bruand, J.; Phung, P.; Ralaivola, L.; Baldi, P., Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity, Bioinformatics, 21, 359-368, (2005) |

[28] | Wale, N.; Watson, I.; Karypis, G., Comparison of descriptor spaces for chemical compound retrieval and classification, Knowledge and Information Systems, 14, 347-375, (2008) |

[29] | Watanabe, S., Information theoretical analysis of multivariate correlation, IBM Journal of Research and Development, 4, 66-82, (1960) · Zbl 0097.35003 |

[30] | Willett, P., Similarity-based virtual screening using 2D fingerprints, Drug Discovery Today, 11, 1046-1051, (2006) |

[31] | Yan, X.; Han, J., Gspan: graph-based substructure pattern mining, Japan, Las Alamitos |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.