From predictive methods to missing data imputation: an optimization approach.

*(English)*Zbl 06982952Summary: Missing data is a common problem in real-world settings and for this reason has attracted significant attention in the statistical literature. We propose a flexible framework based on formal optimization to impute missing data with mixed continuous and categorical variables. This framework can readily incorporate various predictive models including \(K\)-nearest neighbors, support vector machines, and decision tree based methods, and can be adapted for multiple imputation. We derive fast first-order methods that obtain high quality solutions in seconds following a general imputation algorithm opt.impute presented in this paper. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. In all scenarios of missing at random mechanisms and various missing percentages, opt.impute produces the best overall imputation in most data sets benchmarked against five other
methods:
mean impute, \(K\)-nearest neighbors, iterative knn, Bayesian PCA, and predictive-mean matching, with an average reduction in mean absolute error of 8.3% against the best cross-validated benchmark method. Moreover, opt.impute leads to improved out-of-sample performance of learning algorithms trained using the imputed data, demonstrated by computational experiments on 10 downstream tasks. For models trained using opt.impute single imputations with 50% data missing, the average out-of-sample \(R^2\) is 0.339 in the regression tasks and the average out-of-sample accuracy is 86.1% in the classification tasks, compared to 0.315 and 84.4% for the best cross-validated benchmark method. In the multiple imputation setting, downstream models trained using opt.impute obtain a statistically significant improvement over models trained using multivariate imputation by chained equations (mice) in 8/10 missing data scenarios considered.

##### MSC:

68T05 | Learning and adaptive systems in artificial intelligence |

PDF
BibTeX
Cite

\textit{D. Bertsimas} et al., J. Mach. Learn. Res. 18, Paper No. 196, 39 p. (2018; Zbl 06982952)

Full Text:
Link

##### References:

[1] | Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999. · Zbl 1015.90077 |

[2] | Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, pages 1–44, 2017. · Zbl 1455.68159 |

[3] | Dimitris Bertsimas and Rahul Mazumder. Least quantile regression via modern optimization. Annals of Statistics, 42(6):2494–2525, 2014. · Zbl 1302.62154 |

[4] | Dimitris Bertsimas and Bart Van Parys. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions. arXiv preprint arXiv:1709.10029, 2017. |

[5] | Dimitris Bertsimas, Jack Dunn, Colin Pawlowski, and Daisy Zhuo. Robust classification. Submitted for publication, 2017. |

[6] | Trond Hellem Bø, Bjarte Dysvik, and Inge Jonassen. LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Research, 32(3):e34–e34, 2004. |

[7] | L´ıgia P Br´as and Jos´e C Menezes. Improving cluster-based missing value estimation of DNA microarray data. Biomolecular Engineering, 24(2):273–282, 2007. |

[8] | Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and regression trees. CRC press, 1984. · Zbl 0541.62042 |

[9] | Lane F Burgette and Jerome P Reiter. Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172:1070–1076, 2010. |

[10] | Stef Buuren and Karin Groothuis-Oudshoorn. MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 2011. |

[11] | Zhipeng Cai, Maysam Heydari, and Guohui Lin. Iterated local least squares microarray missing value imputation. Journal of Bioinformatics and Computational Biology, 4(05): 935–957, 2006. |

[12] | Rich Caruana.A non-parametric EM-style algorithm for imputing missing values.In AISTATS, 2001. |

[13] | Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3): 273–297, 1995. · Zbl 0831.68098 |

[14] | Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2(Dec):265–292, 2001. · Zbl 1037.68110 |

[15] | Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pages 1–38, 1977. · Zbl 0364.62022 |

[16] | Zoubin Ghahramani and Michael I Jordan. Supervised learning from incomplete data via an EM approach. In Advances in Neural Information Processing Systems, pages 120–127, 1994. 37 |

[17] | James Honaker, Gary King, Matthew Blackwell, et al. Amelia II: A program for missing data. Journal of Statistical Software, 45(7):1–47, 2011. |

[18] | Mohammad E Khan, Guillaume Bouchard, Kevin P Murphy, and Benjamin M Marlin. Variational bounds for mixed-data factor analysis. In Advances in Neural Information Processing Systems, pages 1108–1116, 2010. |

[19] | Hyunsoo Kim, Gene H Golub, and Haesun Park. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, 21(2): 187–198, 2005. |

[20] | Ki-Yeol Kim, Byoung-Jin Kim, and Gwan-Su Yi. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics, 5(1):1, 2004. |

[21] | Alan Wee-Chung Liew, Ngai-Fong Law, and Hong Yan. Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Briefings in Bioinformatics, 12(5):498–513, 2011. |

[22] | Roderick JA Little and Donald B Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, 1987. · Zbl 0665.62004 |

[23] | Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11(Aug): 2287–2322, 2010. · Zbl 1242.68237 |

[24] | Shakir Mohamed, Zoubin Ghahramani, and Katherine A Heller.Bayesian exponential family PCA. In Advances in Neural Information Processing Systems, pages 1089–1096, 2009. |

[25] | Shigeyuki Oba, Masa-aki Sato, Ichiro Takemasa, Morito Monden, Ken-ichi Matsubara, and Shin Ishii. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088–2096, 2003. · Zbl 1013.68788 |

[26] | Trivellore E Raghunathan, James M Lepkowski, John Van Hoewyk, and Peter Solenberger. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27(1):85–96, 2001. |

[27] | Daniel J Stekhoven and Peter B¨uhlmann. Missforest: non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012. |

[28] | Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520–525, 2001. |

[29] | Stef Van Buuren. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3):219–242, 2007. · Zbl 1122.62382 |

[30] | Kiri Wagstaff. Clustering with missing values: No imputation required. In Classification, Clustering, and Data Mining Applications, pages 649–658. Springer, 2004. 38 |

[31] | Xian Wang, Ao Li, Zhaohui Jiang, and Huanqing Feng. Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinformatics, 7(1):1, 2006. |

[32] | Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1): 3–34, 2015. · Zbl 1317.49038 |

[33] | Xiaobai Zhang, Xiaofeng Song, Huinan Wang, and Huanping Zhang. Sequential local least squares imputation estimating missing value of microarray data. Computers in Biology and Medicine, 38(10):1112–1120, 2008. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.