×

Dense neighborhoods on affinity graph. (English) Zbl 1254.68212

Summary: We study the problem of how to reliably compute neighborhoods on affinity graphs. The \(k\)-nearest neighbors (\(k\)NN) is one of the most fundamental and simple methods widely used in many tasks, such as classification and graph construction. Previous research focused on how to efficiently compute kNN on vectorial data. However, most real-world data have no vectorial representations, and only have affinity graphs which may contain unreliable affinities. Since the \(k\)NN of an object \(o\) is a set of \(k\) objects with the highest affinities to \(o\), it is easily disturbed by errors in pairwise affinities between \(o\) and other objects, and also it cannot well preserve the structure underlying the data. To reliably analyze the neighborhood on affinity graphs, we define the \(k\)-dense neighborhood (\(k\)DN), which considers all pairwise affinities within the neighborhood, i.e., not only the affinities between \(o\) and its neighbors but also between the neighbors. For an object \(o\), its \(k\)DN is a set \(k\mathrm{DN}(o)\) of \(k\) objects which maximizes the sum of all pairwise affinities of objects in the set \(\{o\}{\cup}k\mathrm{DN}(o)\). We analyze the properties of \(k\)DN, and propose an efficient algorithm to compute it. Both theoretic analysis and experimental results on shape retrieval, semi-supervised learning, point set matching and data clustering show that \(k\)DN significantly outperforms \(k\)NN on affinity graphs, especially when many pairwise affinities are unreliable.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68T10 Pattern recognition, speech recognition
68R10 Graph theory (including graph drawing) in computer science
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Arya, S., Mount, D., Netanyahu, N., Silverman, R., & Wu, A. (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6), 891–923. · Zbl 1065.68650 · doi:10.1145/293347.293348
[2] Asahiro, Y., Hassin, R., & Iwama, K. (2002). Complexity of finding dense subgraphs. Discrete Applied Mathematics, 121(1–3), 15–26. · Zbl 1002.68108 · doi:10.1016/S0166-218X(01)00243-8
[3] Bentley, J. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 517–525. · Zbl 0306.68061 · doi:10.1145/361002.361007
[4] Bomze, M., & De Klerk, E. (2002). Solving standard quadratic optimization problems via linear, semidefinite and copositive programming. Journal of Global Optimization, 24(2), 163–185. · Zbl 1047.90038 · doi:10.1023/A:1020209017701
[5] Caetano, T., Caelli, T., Schuurmans, D., & Barone, D. (2006). Graphical models and point pattern matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1646–1663. · Zbl 05340989 · doi:10.1109/TPAMI.2006.207
[6] Chan, T. (1998). Approximate nearest neighbor queries revisited. Discrete & Computational Geometry, 20(3), 359–373. · Zbl 0910.68218 · doi:10.1007/PL00009390
[7] Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines.
[8] Chapelle, O., Scholkopf, B., & Zien, A. (2006a) Semi-supervised learning.
[9] Chapelle, O., Scholkopf, B., & Zien, A. (2006b). The Benchmark Data Sets. http://www.kyb.tuebingen.mpg.de/ssl-book/benchmarks.html .
[10] Clauset, A., Moore, C., & Newman, M. (2008). Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191), 98–101. · doi:10.1038/nature06830
[11] Connor, K. P. M. (2010). Fast construction of k-nearest neighbor graphs for point clouds. IEEE Transactions on Visualization and Computer Graphics, 16(4), 599–608. · doi:10.1109/TVCG.2010.9
[12] Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. · Zbl 0154.44505 · doi:10.1109/TIT.1967.1053964
[13] Crammer, K., Talukdar, P., & Pereira, F. (2008). A rate-distortion one-class model and its applications to clustering. In Proceedings of the 25th international conference on machine learning (pp. 184–191).
[14] Denoeux, T. (2008). A k-nearest neighbor classification rule based on Dempster-Shafer theory. In Classic works of the Dempster-Shafer theory of belief functions (pp. 737–760).
[15] Dhillon, I., Guan, Y., & Kulis, B. (2004). Kernel k-means: spectral clustering and normalized cuts. In International conference on knowledge discovery and data mining (pp. 551–556).
[16] Fleishman, S., Cohen-Or, D., & Silva, C. (2005). Robust moving least-squares fitting with sharp features. In ACM special interest group on graphics and interactive techniques (pp. 552–560).
[17] Frank, A., & Asuncion, A. (2010). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[18] Frey, B., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976. · Zbl 1226.94027 · doi:10.1126/science.1136800
[19] Georgescu, B., & Meer, P. (2004). Point matching under large image deformations and illumination changes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 674–688. · Zbl 05111554 · doi:10.1109/TPAMI.2004.2
[20] Gibson, D., Kumar, R., & Tomkins, A. (2005). Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st international conference on very large data bases (pp. 732–741).
[21] Goldstein, M. (1972). K-nearest neighbor classification. IEEE Transactions on Information Theory, 18(5), 627–630. · Zbl 0242.62032 · doi:10.1109/TIT.1972.1054888
[22] Gupta, G., & Ghosh, J. (2005). Robust one-class clustering using hybrid global and local search. In Proceedings of the 22nd international conference on machine learning (pp. 273–280).
[23] Han, E., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. In Advances in knowledge discovery and data mining (pp. 53–65). · Zbl 0978.68700
[24] Horton, P., & Nakai, K. (1997). Better prediction of protein cellular localization sites with the k nearest neighbors classifier. In International conference on intelligent systems for molecular biology (Vol. 5, pp. 147–152).
[25] Hu, H., Yan, X., Huang, Y., Han, J., & Zhou, X. (2005). Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics, 21, 213–226.
[26] Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th annual ACM symposium on theory of computing (pp. 604–613). · Zbl 1029.68541
[27] Jebara, T., & Shchogolev, V. (2006). B-matching for spectral clustering. In European conference on machine learning (pp. 679–686).
[28] Jebara, T., Wang, J., & Chang, S. (2009). Graph construction and b-matching for semi-supervised learning. In International conference on machine learning (pp. 441–448).
[29] Kolahdouzan, M., & Shahabi, C. (2004). Voronoi-based k nearest neighbor search for spatial network databases. In Proceedings of the international conference on very large data bases (pp. 851–860).
[30] Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., & Protopapas, Z. (1996). Fast nearest neighbor search in medical image databases. In Proceedings of the international conference on very large data bases (pp. 215–226).
[31] Kuhn, W., & Tucker, A. (1951). Nonlinear programming. In Proceedings of 2nd Berkeley symposium (pp. 481–492). · Zbl 0044.05903
[32] Leordeanu, M., & Hebert, M. (2005). A spectral technique for correspondence problems using pairwise constraints. In International conference on computer vision (pp. 1482–1489).
[33] Li, L., Weinberg, C., Darden, T., & Pedersen, L. (2001). Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 17(12), 1131–1139. · doi:10.1093/bioinformatics/17.12.1131
[34] Ling, H., & Jacobs, D. (2007). Shape classification using the inner-distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2), 286–299. · Zbl 05340731 · doi:10.1109/TPAMI.2007.41
[35] Motzkin, T., & Straus, E. (1965). Maxima for graphs and a new proof of a theorem of Turán. Canadian Journal of Mathematics, 17(4), 533–540. · Zbl 0129.39902 · doi:10.4153/CJM-1965-053-6
[36] Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: analysis and an algorithm. In Advances in neural information processing systems (pp. 849–856).
[37] Ouyang, Q., Kaplan, P., Liu, S., & Libchaber, A. (1997). DNA solution of the maximal clique problem. Science, 80, 446–448. · doi:10.1126/science.278.5337.446
[38] Pavan, M., & Pelillo, M. (2007). Dominant sets and pairwise clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 167–172. · Zbl 05340705 · doi:10.1109/TPAMI.2007.250608
[39] Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905. · Zbl 05111961 · doi:10.1109/34.868688
[40] Yu, C., Ooi, B., Tan, K., & Jagadish, H. (2001). Indexing the distance: An efficient method to kNN processing. In Proceedings of the international conference on very large data bases (pp. 421–430).
[41] Zaki, M., Parthasarathy, S., Ogihara, M., & Li, W. (1997). New algorithms for fast discovery of association rules. In International conference on knowledge discovery and data mining (Vol. 20, pp. 283–286).
[42] Zhou, D., Bousquet, O., Lal, T., Weston, J., & Schlkopf, B. (2004). Learning with local and global consistency. In Advances in neural information processing systems (pp. 595–602).
[43] Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. In International conference on machine learning (Vol. 20, pp. 912–919).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.