×

Data science vs. statistics: two cultures? (English) Zbl 1430.62017

Summary: Data science is the business of learning from data, which is traditionally the business of statistics. Data science, however, is often understood as a broader, task-driven and computationally-oriented version of statistics. Both the term data science and the broader idea it conveys have origins in statistics and are a reaction to a narrower view of data analysis. Expanding upon the views of a number of statisticians, this paper encourages a big-tent view of data analysis. We examine how evolving approaches to modern data analysis relate to the existing discipline of statistics (e.g. exploratory analysis, machine learning, reproducibility, computation, communication and the role of theory). Finally, we discuss what these trends mean for the future of statistics by highlighting promising directions for communication, education and research.

MSC:

62A01 Foundations and philosophical topics in statistics
62R07 Statistical aspects of big data and data science
68T05 Learning and adaptive systems in artificial intelligence
62G35 Nonparametric robustness
62-08 Computational methods for problems pertaining to statistics
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Alivisatos, P. (2017). Stem and computer science education: Preparing the 21st century workforce. Research and Technology Subcommittee House Committee on Science, Space, and Technology.
[2] Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired Magazine, 16(7), 16-07.
[3] Aravkin, A., & Davis, D. (2016). A smart stochastic algorithm for nonconvex optimization with applications to robust machine learning. arXiv preprint arXiv:161001101.
[4] Association, A. S., et al. (2014). Curriculum guidelines for undergraduate programs in statistical science. Retrieved March 3, 2009, from http://www.amstat.org/education/curriculumguidelines.cfm.
[5] Barnes, N. (2010). Publish your computer code: It is good enough. Nature News, 467(7317), 753-753.
[6] Barocas, S., Boyd, D., Friedler, S., & Wallach, H. (2017). Social and technical trade-offs in data science.
[7] Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828.
[8] Bhardwaj, A. (2017). What is the difference between data science and statistics? https://priceonomics.com/whats-the-difference-between-data-science-and/.
[9] Blei, D. M., & Smyth, P. (2017). Science and data science. Proceedings of the National Academy of Sciences, 114(33), 8689-8692.
[10] Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: Advances in Neural Information Processing Systems (pp. 4349-4357).
[11] Bottou, L., Curtis, F. E., & Nocedal, J. (2016). Optimization methods for large-scale machine learning. arXiv preprint arXiv:160604838. · Zbl 1397.65085
[12] Breiman, L., et al. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199-231. · Zbl 1059.62505
[13] Buckheit, J. B., & Donoho, D. L. (1995). Wavelab and reproducible research. In: Wavelets and statistics (pp. 55-81), Springer. · Zbl 0828.62001
[14] Bühlmann, P., & van de Geer, S. (2018). Statistics for big data: A perspective. Statistics and Probability Letters. · Zbl 1489.62407
[15] Bühlmann, P., & Meinshausen, N. (2016). Magging: maximin aggregation for inhomogeneous large-scale data. Proceedings of the IEEE, 104(1), 126-135.
[16] Bühlmann, P., & Stuart, A. M. (2016). Mathematics, statistics and data science. EMS Newsletter, 100, 28-30.
[17] Chambers, J. M. (1993). Greater or lesser statistics: A choice for future research. Statistics and Computing, 3(4), 182-184.
[18] Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International Statistical Review, 69(1), 21-26. · Zbl 1213.62003
[19] Conway, D. (2010). The data science Venn diagram.
[20] Crawford, K. (2017). The trouble with bias. Conference on Neural Information Processing Systems, invited speaker.
[21] De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., et al. (2017). Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Application, 4, 15-30.
[22] Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745-766.
[23] Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:170208608.
[24] Efron, B., & Hastie, T. (2016). Computer age statistical inference (vol 5). Cambridge: Cambridge University Press. · Zbl 1377.62004
[25] Eick, S. G., Graves, T. L., Karr, A. F., Marron, J., & Mockus, A. (2001). Does code decay? Assessing the evidence from change management data. IEEE Transactions on Software Engineering, 27(1), 1-12.
[26] Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data mining (Vol. 21). Menlo Park: AAAI press.
[27] Felder, R. M., & Brent, R. (2016). Teaching and learning STEM: A practical guide. Hoboken: Wiley.
[28] Freitas, A. A. (2014). Comprehensible classification models: A position paper. ACM SIGKDD Explorations Newsletter, 15(1), 1-10.
[29] Gentleman, R., Carey, V., Huber, W., Irizarry, R., & Dudoit, S. (2006). Bioinformatics and computational biology solutions using R and Bioconductor. Berlin: Springer. · Zbl 1142.62100
[30] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Book in preparation for mit press. http://www.deeplearningbook.org. · Zbl 1373.68009
[31] Graves, T. L., Karr, A. F., Marron, J., & Siy, H. (2000). Predicting fault incidence using software change history. IEEE Transactions on Software Engineering, 26(7), 653-661.
[32] Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (2011). Robust statistics: the approach based on influence functions (Vol. 114). Hoboken: Wiley. · Zbl 0593.62027
[33] Hand, D. J., et al. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1), 1-14. · Zbl 1426.62188
[34] Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O., et al. (2015). Data science in statistics curricula: Preparing students to “think with data”. The American Statistician, 69(4), 343-353. · Zbl 07671752
[35] Hicks, S. C., & Irizarry, R. A. (2017). A guide to teaching data science. The American Statistician(just-accepted). · Zbl 07663965
[36] Hooker, G., & Hooker, C. (2017). Machine learning and the future of realism. arXiv preprint arXiv:170404688. · Zbl 1384.62085
[37] Huber, P. J. (2011). Robust statistics. In: International Encyclopedia of Statistical Science (pp. 1248-1251). Springer.
[38] Jl, Doumont. (2009). Trees, maps, and theorems. Brussels: Principiae.
[39] Kiar, G., Bridgeford, E., Chandrashekhar, V., Mhembere, D., Burns, & R., Roncal, W. G., et al. (2017). A comprehensive cloud framework for accurate and reliable human connectome estimation and meganalysis. bioRxiv p 188706.
[40] Knuth, D. E. (1984). Literate programming. The Computer Journal, 27(2), 97-111. · Zbl 0533.68005
[41] Kross, S., Peng, R. D., Caffo, B. S., Gooding, I., & Leek, J. T. (2017). The democratization of data science education. Peer J (PrePrints). · Zbl 07593649
[42] Leek, J. T., & Peng, R. D. (2015). Opinion: Reproducible research can still be wrong: Adopting a prevention approach. Proceedings of the National Academy of Sciences, 112(6), 1645-1646.
[43] Lipton, Z. C. (2016). The mythos of model interpretability. arXiv preprint arXiv:160603490.
[44] Lu, X., Marron, J., & Haaland, P. (2014). Object-oriented data analysis of cell images. Journal of the American Statistical Association, 109(506), 548-559.
[45] Maronna, R., Martin, R. D., & Yohai, V. (2006). Robust statistics (Vol. 1). Chichester: Wiley. · Zbl 1094.62040
[46] Marron, J. (1999). Effective writing in mathematical statistics. Statistica Neerlandica, 53(1), 68-75. · Zbl 1069.62501
[47] Marron, J. (2017). Big data in context and robustness against heterogeneity. Econometrics and Statistics, 2, 73-80.
[48] Marron, J., & Alonso, A. M. (2014). Overview of object oriented data analysis. Biometrical Journal, 56(5), 732-753. · Zbl 1309.62008
[49] Members, R. P. (2017). The r project for statistical computing. https://www.r-project.org/.
[50] Naur, P. (1974). Concise survey of computer methods. · Zbl 0331.68001
[51] Network, C. G. A., et al. (2012). Comprehensive molecular characterization of human colon and rectal cancer. Nature, 487(7407), 330-337.
[52] Nolan, D., & Temple Lang, D. (2010). Computing in the statistics curricula. The American Statistician, 64(2), 97-107. · Zbl 1205.00052
[53] O’Neil, C. (2017). Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books. · Zbl 1441.00001
[54] Patil, D. (2011). Building data science teams. “O’Reilly Media, Inc.”.
[55] Patil, P., Peng, R. D., & Leek, J. (2016). A statistical definition for reproducibility and replicability. bioRxiv p 066803.
[56] Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226-1227.
[57] Perez, F., & Granger, B. E. (2015). Project jupyter: Computational narratives as the engine of collaborative data science. Tech. rep., Technical Report. Technical report, Project Jupyter.
[58] Pizer, Stephen M.; Marron, J. S., Object Statistics on Curved Manifolds, 137-164 (2017)
[59] Reid, N. (2018). Statistical science in the world of big data. Statistics and Probability Letters. · Zbl 1489.62415
[60] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144). ACM.
[61] Russell, S., & Norvig, P. (2009). Artificial intelligence: A modern approach. Egnlewood Cliffs: Artificial Intelligence Prentice-Hall. · Zbl 0835.68093
[62] Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten simple rules for reproducible computational research. PLoS Computational Biology, 9(10), 285. (e1003) .
[63] Smith, M. T., Zwiessele, M., & Lawrence, N. D. (2016) Differentially private Gaussian processes. arXiv preprint arXiv:160600720.
[64] Sonnenburg, S., Braun, M. L., Ong, C. S., Bengio, S., Bottou, L., Holmes, G., et al. (2007). The need for open source software in machine learning. Journal of Machine Learning Research, 8(oct), 2443-2466.
[65] Staudte, R. G., & Sheather, S. J. (2011). Robust estimation and testing (Vol. 918). Hoboken: Wiley. · Zbl 0706.62037
[66] Stodden, V. (2012). Reproducible research for scientific computing: Tools and strategies for changing the culture. Computing in Science and Engineering, 14(4), 13-17.
[67] Tao, T. (2007). What is good mathematics? Bulletin of the American Mathematical Society, 44(4), 623-634. · Zbl 1132.00303
[68] Tukey, J. W. (1962). The future of data analysis. The Annals of Mathematical Statistics, 33(1), 1-67. · Zbl 0107.36401
[69] Wang, H., & Marron, J. (2007). Object oriented data analysis: Sets of trees. The Annals of Statistics, 1849-1873. · Zbl 1126.62002
[70] Wasserman, Larry, Rise of the machines, 525-536 (2014)
[71] Wickham, H. (2015). R packages: Organize, test, document, and share your code. O’Reilly Media, Inc.
[72] Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., et al. (2014). Best practices for scientific computing. PLoS Biology, 12(1), 745. (e1001) .
[73] Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLoS Computational Biology, 13(6), 510. (e1005) .
[74] Wu, C. (1998). Statistics = data science? http://www2.isye.gatech.edu/ jeffwu/presentations/datascience.pdf.
[75] Xie, Y. (2015). Dynamic Documents with R and knitr (Vol. 29). Boca Raton: CRC Press.
[76] Yu, B. (2014). Ims presidential address: Let us own data science. http://bulletin.imstat.org/2014/10/ims-presidential-address-let-us-own-data-science/.
[77] Zarsky, T. (2016). The trouble with algorithmic decisions: An analytic road map to examine efficiency and fairness in automated and opaque decision making. Science, Technology, and Human Values, 41(1), 118-132.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.