Selection-fusion approach for classification of datasets with missing values.

*(English)*Zbl 1191.68573Summary: This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values.

##### MSC:

68T10 | Pattern recognition, speech recognition |

68T05 | Learning and adaptive systems in artificial intelligence |

##### Keywords:

missing value management; subspace classifiers; ensemble classifiers; multiple imputations; pruning; support vector machine
PDF
BibTeX
XML
Cite

\textit{M. Ghannad-Rezaie} et al., Pattern Recognition 43, No. 6, 2340--2350 (2010; Zbl 1191.68573)

Full Text:
DOI

##### References:

[1] | Joseph, G.I.; Ming-Hui, C.; Stuart, R.L.; Amy, H.H., Missing-data methods for generalized linear models: a comparative review, Journal of the American statistical association, 100, 332-346, (2005) · Zbl 1117.62360 |

[2] | Batista, P.A.; Monard, M.C., An analysis of four missing data treatment methods for supervised learning, Applied artificial intelligence, 17, 519-533, (2003) |

[3] | Lakshminarayan, K.; Harp, S.A.; Samad, T., Imputation of missing data in industrial databases, Applied intelligence, 11, 3, 259-275, (1999) |

[4] | Zenko, B.; Todorovski, L.; Dzeroski, S., A comparison of stacking with meta decision trees to bagging, boosting, and stacking with other methods, Journal of the American statistical association, 84, 669-670, (2001) |

[5] | Kuncheva, L.I., Combining pattern classifiers: methods and algorithms, (2004), Wiley Hoboken, NJ · Zbl 1066.68114 |

[6] | S. Hettich, S.D. Bay, The UCI KDD Archive, http://kdd.ics.uci.edu, Department of information and Computer Science, Irvine, CA, 1999. |

[7] | Kargupta, K.; Park, B.H.; Dutta, H., Orthogonal decision trees, IEEE transactions on knowledge and data engineering, 18, 1028-1042, (2006) |

[8] | Myunghee Cho, P., The generalized estimating equation approach when data are not missing completely at random, Journal of the American statistical association, 92, 1320, (1997) · Zbl 0913.62052 |

[9] | Ishii, N.; Tsuchiya, E.; Bao, Y.; Yamaguchi, N., Combining classification improvements by ensemble processing, Journal of the American statistical association, 61, 240-246, (2005) |

[10] | Huang, X.; Zhu, Q., A pseudo-nearest-neighbor approach for missing data recovery on Gaussian random data sets, Pattern recognition letters, 23, 1613-1622, (2002) · Zbl 1007.68916 |

[11] | Lin, T.I.; Lee, J.C.; Ho, H.J., On fast supervised learning for normal mixture models with missing information, Pattern recognition, 39, 1177-1187, (2006) · Zbl 1096.68723 |

[12] | Rubin, D.B., Multiple imputation after 18 plus years, Journal of the American statistical association, 91, 473-477, (1996) · Zbl 0869.62014 |

[13] | Schapire, R.E., The boosting approach to machine learning: an overview, () · Zbl 1142.62372 |

[14] | Schafer, J.L., Analysis of incomplete multivariate data, (1997), Academic Press London · Zbl 0997.62510 |

[15] | Walls, T.A.; Schafer, J.L., Models for intensive longitudinal data, (2006), Oxford University Press Oxford · Zbl 1103.62306 |

[16] | Aksela, M.; Laaksonen, J., Using mutual information of errors for selecting members of a committee classifier, Pattern recognition, 39, 608-623, (2006) · Zbl 1122.68516 |

[17] | Hu, Q.; Yu, D.; Xie, Z.; Li, X., EROS: ensemble rough subspaces, Pattern recognition, 40, 3728-3739, (2007) · Zbl 1129.68074 |

[18] | K. Jiang, H. Chen, S. Yuan, Classification for incomplete data using classifier ensembles, in: Proceedings of the 45th Institute of CETC, vol. 1, 2006, pp. 559-563. |

[19] | Feng, H.; Liu, B.; He, L.; Yang, B.; Chen, Y., Using dependencies between attributes to identify and correct the mistakes in SARS data set, Intelligent data analysis, 9, 5678-5681, (2005) |

[20] | Mizutani, H., Discriminative learning for minimum error and minimum reject classification, Intelligent data analysis, 1, 136-140, (1999) |

[21] | Roli, F.; Fumera, G.; Vernazza, G., Analysis of error-reject trade-off in linearly combined classifiers, Intelligent data analysis, 2, 120-123, (2002) |

[22] | Tao, Q.; Wu, G.; Wang, J., A new maximum margin algorithm for one-class problems and its boosting implementation, Pattern recognition, 38, 1071-1077, (2005) · Zbl 1074.68595 |

[23] | Siadat, M.R.; Soltanian-Zadeh, H.; Fotouhi, F.; Elsevich, K.V., Content-based image database system for epilepsy, Computer methods and programs in biomedicine, 79-3, 209-226, (2005) |

[24] | M. Ghannad-Rezaie, H. Soltanian-Zadeh, M.R. Siadat, K.V. Elisevich, Soft computing approaches to computer aided decision making for temporal lobe epilepsy, in: IEEE Conference on Image Processing, vol. 2, 2005, pp. 42-45. |

[25] | M. Ghannad-Rezaie, H. Soltanian-Zadeh, Interactive knowledge discovery for temporal lobe epilepsy, in: E.G. Giannopoulou (Ed.), Data Mining in Medical and Biological Research, I-Tech Education and Publishing KG, Vienna, Austria, November 2008 (Chapter 8). |

[26] | Lin, T.I.; Lee, J.C.; Ho, H.J., On fast supervised learning for normal mixture models with missing information, Pattern recognition, 39, 1177-1187, (2006) · Zbl 1096.68723 |

[27] | Markey, M.K.; Tourassi, G.D.; Margolis, M.; DeLong, D.M., Impact of missing data in evaluating artificial neural networks trained on complete data, Computers in biology and medicine, 36, 516-525, (2006) |

[28] | Auleley, G.-R.; Giraudeau, B.; Baron, G.; Maillefert, J.-F.; Dougados, M.; Ravaud, P., The methods for handling missing data in clinical trials influence sample size requirements, Journal of clinical epidemiology, 57, 447-453, (2004) |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.