How to lie with bad data. (English) Zbl 1100.62533

Summary: As Huff’s landmark book made clear, lying with statistics can be accomplished in many ways. Distorting graphics, manipulating data or using biased samples are just a few of the tried and true methods. Failing to use the correct statistical procedure or failing to check the conditions for when the selected method is appropriate can distort results as well, whether the motives of the analyst are honorable or not. Even when the statistical procedure and motives are correct, bad data can produce results that have no validity at all. This article provides some examples of how bad data can arise, what kinds of bad data exist, how to detect and measure bad data, and how to improve the quality of data that have already been collected.


62A01 Foundations and philosophical topics in statistics
62-07 Data analysis (statistics) (MSC2010)
Full Text: DOI


[1] Baggerly, K. A, Morris, J. S. and Coombes, K. R. (2004). Reproducibility of SELDI-TOF protein patterns in serum: Comparing datasets from different experiments. Bioinformatics 20 777–785.
[2] Brunskill, A. J. (1990). Some sources of error in the coding of birth weight. American J. Public Health 80 72–73.
[3] Check, E. (2004). Proteomics and cancer: Running before we can walk? Nature 429 496–497.
[4] Coale, A. J. and Stephan, F. F. (1962). The case of the Indians and the teen-age widows. J. Amer. Statist. Assoc. 57 338–347.
[5] De Veaux, R. D. (2002). Data mining: A view from down in the pit. Stats (34) 3–9.
[6] De Veaux, R. D., Donahue, R. and Small, R. D. (2002). Using data mining techniques to harvest information in clinical trials. Presentation at Joint Statistical Meetings, New York.
[7] De Veaux, R. D., Gordon, A., Comiso, J. and Bacherer, N. E. (1993). Modeling of topographic effects on Antarctic sea-ice using multivariate adaptive regression splines. J. Geophysical Research —Oceans 98 20,307–20,320.
[8] Hand, D. J. (2001). Reject inference in credit operations. In Handbook of Credit Scoring (E. Mays, ed.) 225–240. Glenlake Publishing, Chicago.
[9] Hand, D. J. (2004a). Academic obsessions and classification realities: Ignoring practicalities in supervised classification. In Classification, Clustering and Data Mining Applications (D. Banks, L. House, F. R. McMorris, P. Arabie and W. Gaul, eds.) 209–232. Springer, Berlin.
[10] Hand, D. J. (2004b). Measurement Theory and Practice: The World Through Quantification . Arnold, London. · Zbl 1057.91068
[11] Hand, D. J., Blunt, G., Kelly, M. G. and Adams, N. M. (2000). Data mining for fun and profit (with discussion). Statist. Sci. 15 111–131.
[12] Hand, D. J. and Henley, W. E. (1993). Can reject inference ever work? IMA J. of Mathematics Applied in Business and Industry 5 (4) 45–55.
[13] Huff, D. (1954). How to Lie with Statistics . Norton, New York.
[14] Jones, P. D. and Wigley, T. M. L. (1990). Global warming trends. Scientific American 263 (2) 84–91.
[15] Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K. and Lee, D. (2003). A taxonomy of dirty data. Data Mining and Knowledge Discovery 7 81–99. · Zbl 05660812
[16] Klein, B. D. (1998). Data quality in the practice of consumer product management: Evidence from the field. Data Quality 4 (1).
[17] Kruskal, W. (1981). Statistics in society: Problems unsolved and unformulated. J. Amer. Statist. Assoc. 76 505–515. · Zbl 0478.62090
[18] Laudon, K. C. (1986). Data quality and due process in large interorganizational record systems. Communications of the ACM 29 4–11.
[19] Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data . Wiley, New York. · Zbl 0665.62004
[20] Loshin, D. (2001). Enterprise Knowledge Management: The Data Quality Approach . Morgan Kaufmann, San Francisco.
[21] Madnick, S. E. and Wang, R. Y. (1992). Introduction to the TDQM research program. Working Paper 92-01, Total Data Quality Management Research Program.
[22] Morey, R. C. (1982). Estimating and improving the quality of information in a MIS. Communications of the ACM 25 337–342.
[23] Percy, T. (1986). My data, right or wrong. Datamation 32 (11) 123–124.
[24] Petricoin, E. F., III, Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B., Simone, C., Fishman, D. A., Kohn, E. C. and Liotta, L. A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359 572–577.
[25] Pierce, E. (1997). Modeling database error rates. Data Quality 3 (1). Available at www.dataquality.com/dqsep97.htm.
[26] PricewaterhouseCoopers (2004). The Tech Spotlight 22 . Available at www.pwc.com/extweb/manissue.nsf/docid/ 2D6E2F57E06E022F85256B8F006F389A.
[27] Redman, T. C. (1992). Data Quality. Management and Technology . Bantam, New York.
[28] Strayhorn, J. M. (1990). Estimating the errors remaining in a data set: Techniques for quality control. Amer. Statist. 44 14–18.
[29] Wainer, H. (2004). Curbstoning IQ and the 2000 presidential election. Chance 17 (4) 43–46.
[30] West, M. and Winkler, R. L. (1991). Data base error trapping and prediction. J. Amer. Statist. Assoc. 86 987–996.
[31] Willenborg, L. and de Waal, T. (2001). Elements of Statistical Disclosure Control . Springer, New York. · Zbl 0973.62009
[32] Wolins, L. (1962). Responsibility for raw data. American Psychologist 17 657–658.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.