×

DataWig: missing value imputation for tables. (English) Zbl 1436.62051

Summary: With the growing importance of machine learning (ML) algorithms for practical applications, reducing data quality problems in ML pipelines has become a major focus of research. In many cases missing values can break data pipelines which makes completeness one of the most impactful data quality challenges. Current missing value imputation methods are focusing on numerical or categorical data and can be difficult to scale to datasets with millions of rows. We release DataWig, a robust and scalable approach for missing value imputation that can be applied to tables with heterogeneous data types, including unstructured text. DataWig combines deep learning feature extractors with automatic hyperparameter tuning. This enables users without a machine learning background, such as data engineers, to impute missing values with minimal effort in tables with more heterogeneous data types than supported in existing libraries, while requiring less glue code for feature engineering and offering more flexible modelling options. We demonstrate that DataWig compares favourably to existing imputation packages. Source code, documentation, and unit tests for this package are available at: https://github.com/awslabs/datawig.

MSC:

62D10 Missing data
68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: Link

References:

[1] Gustavo Batista and Maria Carolina Monard. An analysis of four missing data treatment methods for supervised learning.Applied Artificial Intelligence, 17(5-6):519-533, 2003.
[2] benchmarks github repository.https://github.com/awslabs/datawig/blob/master/ experiments/benchmarks.py.
[3] James Bergstra and Yoshua Bengio. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13:281-305, 2012. URLhttp://dl.acm.org/
[4] Felix Biessmann, David Salinas, Sebastian Schelter, Philipp Schmidt, and Dustin Lange. “deep” learning for missing value imputation in tables with non-numerical data. InInternational
[5] Ramiro D. Camino, Christian A. Hammerschmidt, and Radu State. Improving Missing Data Imputation with Deep Generative Models. feb 2019. URLhttp://arxiv.org/abs/1902. 10666.
[6] Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F Ilyas, Mourad Ouzzani, and Nan Tang. Nadeef: a commodity data cleaning system. InACM SIGMOD, pages 541-552. ACM, 2013.
[7] Lovedeep Gondara and Ke Wang. Multiple imputation using deep denoising autoencoders. CoRR, abs/1705.02737, 2017. URLhttp://arxiv.org/abs/1705.02737.
[8] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. InInternational Conference on Machine Learning (ICML), 2017.
[9] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems.IEEE Computer, 42(8):30-37, 2009.
[10] Zachary C. Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and Correcting for Label Shift with Black Box Predictors.International Conference on Machine Learning (ICML), 2018.
[11] R. J. A. Little and D. B. Rubin.Statistical analysis with missing data. 2nd ed.WileyInterscience, Hoboken, NJ„ 2002.
[12] Pierre-Alexandre Mattei and Jes Frellsen. MIWAE: Deep generative modelling and imputation of incomplete data sets. InInternational Conference on Machine Learning (ICML), 2019.
[13] Imke Mayer, Julie Josse, Nicholas Tierney, and Nathalie Vialaneix. R-miss-tastic: a unified platform for missing values methods and workflows. art. arXiv:1908.04822, 2019.
[14] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices.Journal of Machine Learning Research, 11:2287- 2322, 2010. URLhttp://portal.acm.org/citation.cfm?id=1859931.
[15] A Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. Handling Incomplete Heterogeneous Data using VAEs. 2018. URLhttps://arxiv.org/pdf/1807.03653.pdf.
[16] Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, and Gyuri Szarvas. On challenges in machine learning model management.IEEE Data
[17] D Sculley, G Holt, D Golovin, E Davydov, T Phillips, D Ebner, V Chaudhary, M Young, and D Dennison. Hidden Technical Debt in Machine Learning Systems.Neural Information
[18] Daniel J Stekhoven and Peter Bühlmann. MissForest—non-parametric missing value imputation for mixed-type data.Bioinformatics, 28(1):112-118, 2012.
[19] Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B. Altman. Missing value estimation methods for DNA microarrays.Bioinformatics, 17(6):520-525, 2001.
[20] S. van Buuren.Flexible Imputation of Missing Data. 2nd ed.CRC/Chapman & Hall, 2018.
[21] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. GAIN: Missing Data Imputation using Generative Adversarial Nets.International Conference on Machine Learning (ICML), 2018. URLhttp://arxiv.org/abs/1806.02920.
[22] Hongbao Zhang, Pengtao Xie, and Eric P.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.