×

The PRIMPING routine – tiling through proximal alternating linearized minimization. (English) Zbl 1409.68233

Summary: Mining and exploring databases should provide users with knowledge and new insights. Tiles of data strive to unveil true underlying structure and distinguish valuable information from various kinds of noise. We propose a novel Boolean matrix factorization algorithm to solve the tiling problem, based on recent results from optimization theory. In contrast to existing work, the new algorithm minimizes the description length of the resulting factorization. This approach is well known for model selection and data compression, but not for finding suitable factorizations via numerical optimization. We demonstrate the superior robustness of the new approach in the presence of several kinds of noise and types of underlying structure. Moreover, our general framework can work with any cost measure having a suitable real-valued relaxation. Thereby, no convexity assumptions have to be met. The experimental results on synthetic data and image data show that the new method identifies interpretable patterns which explain the data almost always better than the competing algorithms.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
15B34 Boolean and Hadamard matrices
65F30 Other matrix algorithms (MSC2010)
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Bauckhage C (2015) k-means clustering is matrix factorization. arXiv preprint arXiv:1512.07548
[2] Bolte, J.; Sabach, S.; Teboulle, M., Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Math Program, 146, 459-494, (2014) · Zbl 1297.90125
[3] Cover T, Thomas J (2006) Elements of information theory. Wiley-Interscience, Hoboken · Zbl 1140.94001
[4] Bie, T., Maximum entropy models and subjective interestingness: an application to tiles in binary databases, Data Min Knowl Discov, 23, 407-446, (2011) · Zbl 1235.68065
[5] Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 126-135
[6] Ding CH, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the SIAM international conference on data mining (SDM), pp 606-610
[7] Geerts, Floris; Goethals, Bart; Mielikäinen, Taneli, Tiling Databases, 278-289, (2004), Berlin, Heidelberg · Zbl 1110.68373
[8] Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge
[9] Hess S, Piatkowski N, Morik K (2014) Shrimp: descriptive patterns in a tree. In: Proceedings of the LWA workshops: KDML, IR and FGWM, pp 181-192
[10] Jarrett K, Kavukcuoglu K, Ranzato M, LeCun Y (2009) What is the best multi-stage architecture for object recognition? In: IEEE international conference on computer in proceedings (ICCV), pp 2146-2153
[11] Karaev, Sanjar; Miettinen, Pauli; Vreeken, Jilles, Getting to Know the Unknown Unknowns: Destructive-Noise Resistant Boolean Matrix Factorization, 325-333, (2015), Philadelphia, PA
[12] Kontonasios, Kleanthis-Nikolaos; De Bie, Tijl, An information-theoretic approach to finding informative noisy tiles in binary databases, 153-164, (2010), Philadelphia, PA
[13] Kuhn, HW, The hungarian method for the assignment problem, Naval Res Logist Q, 2, 83-97, (1955)
[14] Lee, DD; Seung, HS, Learning the parts of objects by non-negative matrix factorization, Nature, 401, 788-791, (1999) · Zbl 1369.68285
[15] Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Advances in neural information processing systems (NIPS), pp 556-562
[16] Li PVM (1997) An introduction to kolmogorov complexity and its applications. Springer, Berlin
[17] Li T (2005) A general model for clustering binary data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery in data mining (KDD), pp 188-197
[18] Li T, Ding C (2006) The relationships among various nonnegative matrix factorization methods for clustering. In: International conference on data mining (ICDM), pp 362-371
[19] Lucchese, Claudio; Orlando, Salvatore; Perego, Raffaele, Mining Top-K Patterns from Binary Datasets in presence of Noise, 165-176, (2010), Philadelphia, PA
[20] Lucchese, C.; Orlando, S.; Perego, R., A unifying framework for mining approximate top-k binary patterns, Trans Knowl Data Eng, 26, 2900-2913, (2014)
[21] Miettinen, Pauli, Generalized Matrix Factorizations as a Unifying Framework for Pattern Set Mining: Complexity Beyond Blocks, 36-52, (2015), Cham
[22] Miettinen, P.; Vreeken, J., Mdl4bmf: minimum description length for boolean matrix factorization, Trans Knowl Discov Data, 8, 18:1-18:31, (2014)
[23] Miettinen, P.; Mielikainen, T.; Gionis, A.; Das, G.; Mannila, H., The discrete basis problem, Trans Knowl Data Eng, 20, 1348-1362, (2008)
[24] Paatero, P.; Tapper, U., Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values, Environmetrics, 5, 111-126, (1994)
[25] Parikh, N.; Boyd, S., Proximal algorithms, Found Trends Optim, 1, 127-239, (2014)
[26] Rissanen, J., Modeling by shortest data description, Automatica, 14, 465-471, (1978) · Zbl 0418.93079
[27] Siebes, Arno; Kersten, René, A Structure Function for Transaction Data, 558-569, (2011), Philadelphia, PA
[28] Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SIAM international conference on data mining (SDM), pp 393-404 · Zbl 1235.68071
[29] Smets, Koen; Vreeken, Jilles, Slim: Directly Mining Descriptive Patterns, 236-247, (2012), Philadelphia, PA
[30] Tatti, N.; Vreeken, J., Comparing apples and oranges: measuring differences between exploratory data mining results, Data Min Knowl Discov, 25, 173-207, (2012) · Zbl 1260.68347
[31] van Leeuwen M, Siebes A (2008) Streamkrimp: Detecting change in data streams. In: European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 672-687
[32] Vreeken, J.; Leeuwen, M.; Siebes, A., Krimp: mining itemsets that compress, Data Min Knowl Discov, 23, 169-214, (2011) · Zbl 1235.68071
[33] Wang, YX; Zhang, YJ, Nonnegative matrix factorization: a comprehensive review, Trans Knowl Data Eng, 25, 1336-1353, (2013)
[34] Xiang, Y.; Jin, R.; Fuhry, D.; Dragan, FF, Summarizing transactional databases with overlapped hyperrectangles, Data Min Knowl Discov, 23, 215-251, (2011) · Zbl 1235.68068
[35] Zhang Z, Ding C, Li T, Zhang X (2007) Binary matrix factorization with applications. In: International conference on data mining (ICDM), pp 391-400
[36] Zimek, A.; Vreeken, J., The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives, Mach Learn, 98, 121-155, (2013) · Zbl 1321.68413
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.