DataWig swMATH ID: 33871 Software Authors: Biessmann, Felix; Rukat, Tammo; Schmidt, Phillipp; Naidu, Prathik; Schelter, Sebastian; Taptunov, Andrey; Lange, Dustin; Salinas, David Description: DataWig: missing value imputation for tables. With the growing importance of machine learning (ML) algorithms for practical applications, reducing data quality problems in ML pipelines has become a major focus of research. In many cases missing values can break data pipelines which makes completeness one of the most impactful data quality challenges. Current missing value imputation methods are focusing on numerical or categorical data and can be difficult to scale to datasets with millions of rows. We release DataWig, a robust and scalable approach for missing value imputation that can be applied to tables with heterogeneous data types, including unstructured text. DataWig combines deep learning feature extractors with automatic hyperparameter tuning. This enables users without a machine learning background, such as data engineers, to impute missing values with minimal effort in tables with more heterogeneous data types than supported in existing libraries, while requiring less glue code for feature engineering and offering more flexible modelling options. We demonstrate that DataWig compares favourably to existing imputation packages. Source code, documentation, and unit tests for this package are available at: github.com/awslabs/datawig Homepage: https://datawig.readthedocs.io/en/latest/ Source Code: https://github.com/awslabs/datawig Keywords: missing value imputation; deep learning; heterogeneous data Related Software: GitHub; GAIN; missForest Cited in: 1 Publication Standard Articles 1 Publication describing the Software, including 1 Publication in zbMATH Year DataWig: missing value imputation for tables. Zbl 1436.62051Biessmann, Felix; Rukat, Tammo; Schmidt, Phillipp; Naidu, Prathik; Schelter, Sebastian; Taptunov, Andrey; Lange, Dustin; Salinas, David 2019 all top 5 Cited by 8 Authors 1 Biessmann, Felix 1 Lange, Dustin 1 Naidu, Prathik 1 Rukat, Tammo 1 Salinas, David 1 Schelter, Sebastian 1 Schmidt, Phillipp 1 Taptunov, Andrey Cited in 1 Serial 1 Journal of Machine Learning Research (JMLR) Cited in 2 Fields 1 Statistics (62-XX) 1 Computer science (68-XX) Citations by Year