Data mining in electronic commerce. (English) Zbl 1426.62366

Summary: Modern business is rushing toward e-commerce. If the transition is done properly, it enables better management, new services, lower transaction costs and better customer relations. Success depends on skilled information technologists, among whom are statisticians. This paper focuses on some of the contributions that statisticians are making to help change the business world, especially through the development and application of data mining methods. This is a very large area, and the topics we cover are chosen to avoid overlap with other papers in this special issue, as well as to respect the limitations of our expertise. Inevitably, electronic commerce has raised and is raising fresh research problems in a very wide range of statistical areas, and we try to emphasize those challenges.


62P20 Applications of statistics to economics
Full Text: DOI arXiv Euclid


[1] Allen, G. N., Burk, D. L. and Davis, G. B. (2006). Academic data collection in electronic environments: Defining acceptable use of Internet resources. MIS Quarterly 30 (3).
[2] Ball, P. (2003). Using multiple system estimation to assess the magnitude and pattern of political killings in Guatemala and Kosovo. Bull. Internat. Statist. Inst. , 54th session.
[3] Banks, D., Over, P. and Zhang, N.-F. (1999). Blind men and elephants: Six approaches to TREC data. Information Retrieval 1 7–34.
[4] Bapna, R., Goes, P., Gopal, R. and Marsden, J. (2006). Moving from data-constrained to data-enabled research: Experiences and challenges in collecting, validating and analyzing large-scale e-commerce data. Statist. Sci. 21 116–130. · Zbl 1426.62367
[5] Bickel, P. J. and Levina, E. (2004). Some theory of Fisher’s linear discriminant function, ‘naive Bayes,’ and some alternatives when there are many more variables than observations. Bernoulli 10 989–1010. · Zbl 1064.62073
[6] Blum, L., Blum, M. and Shub, M. (1986). A simple unpredictable pseudorandom number generator. SIAM J. Comput. 15 364–383. · Zbl 0602.65002
[7] Bradlow, E. T. and Schmittlein, D. C. (2000). The little engines that could: Modeling the performance of the World Wide Web search engines. Marketing Sci. 19 43–62.
[8] Chatterjee, P., Hoffman, D. L. and Novak, T. (2003). Modeling the clickstream: Implications for web-based advertising efforts. Marketing Sci. 22 520–541.
[9] Clyde, M. and George, E. I. (2004). Model uncertainty. Statist. Sci. 19 81–94. · Zbl 1062.62044
[10] Dobra, A. and Fienberg, S. E. (2003). How large is the World Wide Web? In Web Dynamics (M. Levene and A. Poulovassilis, eds.) 23–44. Springer, New York.
[11] Donoho, D. L. and Huber, P. J. (1983). The notion of breakdown point. In A Festschrift for Erich L. Lehmann (P. Bickel, K. Doksum and J. Hodges, eds.) 157–184. Wadsworth, Belmont, CA. · Zbl 0523.62032
[12] Dumais, S. (1991). Improving the retrieval of information from external sources. Behavior Research Methods , Instruments , and Computers 23 229–236.
[13] Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. J. Amer. Statist. Assoc. 64 1183–1210. · Zbl 0186.53903
[14] Fienberg, S. E. (2006). Privacy and confidentiality in an e-commerce world: Data mining, data warehousing, matching and disclosure limitation. Statist. Sci. 21 143–154. · Zbl 1426.68077
[15] Friedman, J. H. and Popescu, B. E. (2005). Predictive learning via rule ensembles. Available at stat.stanford.edu/ jhf/#selected. · Zbl 1149.62051
[16] Ghose, A. and Sundararajan, A. (2006). Evaluating pricing strategy using e-commerce data: Evidence and estimation challenges. Statist. Sci. 21 131–142. · Zbl 1426.62371
[17] Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237–264. JSTOR: · Zbl 0051.37103
[18] Hand, D. and Yu, K. (2001). Idiot’s Bayes—Not so stupid after all. Internat. Statist. Rev. 69 385–398. · Zbl 1213.62010
[19] Harman, D. K., ed. (1994). The Second Text Retrieval Conference ( TREC-2 ) . National Institute of Standards and Technology (NIST special publication 500-215), Gaithersburg, MD.
[20] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning . Springer, New York. · Zbl 0973.62007
[21] Hui, K.-L. and Png, I. P. L. (2006). The economics of privacy. In Handbook on Economics and Information Systems .
[22] Karr, A. F., Lin, X., Sanil, A. P. and Reiter, J. P. (2005). Secure regression on distributed databases. J. Comput. Graph. Statist. 14 263–279.
[23] Karr, A. F., Sanil, A. P. and Banks, D. L. (2006). Data quality: A statistical perspective. Statist. Methodology 3 137–173. · Zbl 1248.62001
[24] Kohavi, R., Mason, L., Parekh, R. and Zheng, Z. (2004). Lessons and challenges from mining retail e-commerce data. Machine Learning 57 83–113.
[25] Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature 400 107–109.
[26] Liggett, W. and Buckley, C. (2005). System performance and natural language expression of information needs. Information Retrieval 8 101–128.
[27] Madigan, D. (2005). Statistics and the war on spam. In Statistics : A Guide to the Unknown , 4th ed. (R. Peck, G. Casella, G. Cobb, R. Hoerl, D. Nolan, R. Starbuck and H. Stern, eds.) 135–147. Thomson Brooks/Cole, Belmont, CA.
[28] Mauldin, M. L. (1991). Retrieval performance in FERRET, a conceptual information retrieval system. In Proc. 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (E. Fox, ed.) 347–355. ACM Press, New York.
[29] Maxion, R. and Tan, K. (2002). Anomaly detection in embedded systems. IEEE Transactions on Computers 51 108–120.
[30] Miller, D. R. H., Leek, T. and Schwartz, R. M. (1999). A hidden Markov model information retrieval system. In Proc. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (F. Gey, M. Hearst and R. Tony, eds.) 214–221. ACM Press, New York.
[31] Moe, W. and Fader, P. S. (2004). Dynamic conversion behavior at e-commerce sites. Management Sci. 50 326–335.
[32] National commission for the protection of human subjects of Biomedical and Behavioral Research (1979). The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects in Research. National Institues of Health.
[33] Paice, C. D. (1996). Method for evaluation of stemming algorithms based on error counting. J. Amer. Soc. Information Science 47 632–649.
[34] Rimm, M. (1995). Marketing pornography on the information highway: A survey of 917,410 images, descriptions, short stories, and animations downloaded 8.5 million times by consumers in over 2000 cities and territories. Georgetown Law J. 83 1849–1934.
[35] Rivest, R. L., Shamir, A. and Adleman, L. (1978). A method for obtaining digital signatures and public-key cryptosystems. Comm. ACM 21 120–126. · Zbl 0368.94005
[36] Schapire, R., Singer, Y. and Singhal, A. (1998). Boosting and rocchio applied to text filtering. In Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (W. B. Croft, A. Moffat, C. van Rijsbergen, R. Wilkinson and J. Zobel, eds.) 215–223. ACM Press, New York.
[37] Shmueli, G. and Jank, W. (2005). Visualizing online auctions. J. Comput. Graph. Statist. 14 299–319. · Zbl 1198.91007
[38] Shmueli, G. and Jank, W. (2006). Modeling the dynamics of online auctions: A modern statistical approach. In Economics , Information Systems and E-commerce Research II : Advanced Empirical Methods (R. Kauffman and P. Tallon, eds.). Sharpe, Armonk, NY.
[39] Sismeiro, C. and Bucklin, R. E. (2004). Modeling purchase behavior at an e-commerce web site: A task completion approach. J. Marketing Research 41 306–323.
[40] Sullivan, D. (2004). Search engine size wars V erupts. Search Engine Watch . Available at blog.searchenginewatch. com/blog/041111-084221.
[41] U.S. Census Bureau (2005). E-Stats . May 11. Available at www.census.gov/estats.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.