Classifying negative and positive points by optimal box clustering. (English) Zbl 1358.68240

Summary: In this paper, we address the problem of classifying positive and negative data with the technique known as box clustering. A box is homogeneous if it contains only positive (negative) points. Box clustering means finding a family of homogeneous boxes jointly containing all and only positive (negative) points. We first consider the problem of finding a family with the minimum number of boxes. Then we refine this problem into finding a family which not only consists of the minimum number of boxes but also has points that are covered as many times as possible by the boxes in the family. We call this problem the maximum redundancy problem. We model both problems as set covering problems with column generation. The pricing problem is a Maximum Box problem. Although this problem is NP-hard, there is available in the literature a combinatorial algorithm which performs well. Since the pricing has to be carried out also in the branch-and-bound search of the set covering problem, we also consider how the pricing has to be modified to take care of the branching constraints. The computational results show a good behavior of the set covering approach.


68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T20 Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
Full Text: DOI


[1] Bennett, K. P.; Mangasarian, O. L., Robust linear programming discrimination of two linearly inseparable sets, Optimization Methods and Software, 1, 23-34, (1992)
[2] Bereg, S.; Cabello, S.; Díaz-Báñez, J. M.; Pérez-Lantero, P.; Seara, C.; Ventura, I., The class cover problem with boxes, Computational Geometry, 45, 294-304, (2012) · Zbl 1239.65015
[3] E. Boros, V. Spinelli, F. Ricca, Incompatibility graphs and data mining, in: Proceedings of the 10th Cologne-Twente Workshop on Graphs and combinatorial optimization, Frascati, June 14-16, 2011, http://ctw2011.dia.uniroma3.it/ctw_proceedings.pdf.
[4] Eckstein, J.; Hammer, P. L.; Liu, Y.; Nediak, M.; Simeone, B., The maximum box problem and its application to data analysis, Computational Optimization and Applications, 23, 285-298, (2002) · Zbl 1028.90039
[5] Gao, B. J.; Ester, M.; Xiong, H.; Cai, J.; Schulte, O., The minimum consistent subset cover problem: a minimization view of data mining, IEEE Transactions on Knowledge and Data Engineering, 99, (2011)
[6] Hammer, P. L.; Liu, Y.; Szedmák, S.; Simeone, B., Saturated systems of homogeneous boxes and the logical analysis of numerical data, Discrete and Applied Mathematics, 144, 103-109, (2004) · Zbl 1078.62503
[7] Mangasarian, O. L., Linear and nonlinear separation of patterns by linear programming, Operations Research, 13, 444-451, (1965) · Zbl 0127.36701
[8] M. Maravalle, F. Ricca, B. Simeone, V. Spinelli, Carpal tunnel syndrome automatic classification: electromyography vs. ultrasound imaging, Dipartimento di Scienze Statistiche, Sapienza, Università di Roma, Serie A - Ricerche, n. 11/2011, 2011. · Zbl 1312.92027
[9] Salzberg, S., A nearest hyperrectangle learning method, Machine Learning, 6, 251-276, (1991)
[10] Wettschereck, D.; Dietterich, T. G., An experimental comparison of the nearest-neighbor and nearest-hyperrectangle algorithms, Machine Learning, 19, 5-27, (1995)
[11] W.H. Wolberg, Wisconsin Diagnostic Breast Cancer (WDBC), June 5, 2012, http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.