×

zbMATH — the first resource for mathematics

Apache Mahout: machine learning on distributed dataflow systems. (English) Zbl 07255158
Summary: Apache Mahout is a library for scalable machine learning (ML) on distributed dataflow systems, offering various implementations of classification, clustering, dimensionality reduction and recommendation algorithms. Mahout was a pioneer in large-scale machine learning in 2008, when it started and targeted MapReduce, which was the predominant abstraction for scalable computing in industry at that time. Mahout has been widely used by leading web companies and is part of several commercial cloud offerings.
In recent years, Mahout migrated to a general framework enabling a mix of dataflow programming and linear algebraic computations on backends such as Apache Spark and Apache Flink. This design allows users to execute data preprocessing and model training in a single, unified dataflow system, instead of requiring a complex integration of several specialized systems. Mahout is maintained as a community-driven open source project at the Apache Software Foundation, and is available under https://mahout.apache.org.
MSC:
68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: Link
References:
[1] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning.OSDI, pages 265-283, 2016.
[2] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, et al.The stratosphere platform for big data analytics.VLDB Journal, 23(6):939-964, 2014.
[3] Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, et al. Tfx: A tensorflow-based production-scale machine learning platform. InKDD, pages 1387-1395, 2017.
[4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.JMLR, 3: 993-1022, 2003.
[5] Christoph Boden, Tilmann Rabl, and Volker Markl. Distributed machine learning-but at what cost.Machine Learning Systems Workshop at NeurIPS, 2017.
[6] Leo Breiman. Random forests.JMLR, 45(1):5-32, 2001.
[7] Cheng-Tao Chu, Sang K Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Kunle Olukotun, and Andrew Y Ng. Map-reduce for machine learning on multicore.NeurIPS, pages 281-288, 2007.
[8] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters.Communications of the ACM, 51(1):107-113, 2008.
[9] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis.Journal of the American society for information science, 41(6):391-407, 1990.
[10] Ted Dunning. Accurate methods for the statistics of surprise and coincidence.Computational Linguistics, 19(1):61-74, 1993.
[11] Gene H Golub and Charles F Van Loan.Matrix computations, volume 3. JHU Press, 2012.
[12] Nathan P Halko.Randomized methods for computing low-rank approximations of matrices. PhD thesis, 2012.
[13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems.Computer, (8):30-37, 2009.
[14] Dmitriy Lyubimov and Andrew Palumbo.Apache Mahout: Beyond MapReduce. CreateSpace Independent Publishing Platform, 2016.
[15] Andrew McCallum, Kamal Nigam, and Lyle H Ungar.Efficient clustering of highdimensional data sets with application to reference matching.KDD, pages 169-178, 2000.
[16] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. Mllib: Machine learning in apache spark.JMLR, 17(1):1235-1241, 2016.
[17] Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman.Mahout in action. Manning Publications, 2012.
[18] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python.JMLR, 12(Oct):2825-2830, 2011.
[19] Jason D Rennie, Lawrence Shih, Jaime Teevan, and David R Karger. Tackling the poor assumptions of naive bayes text classifiers.ICML, pages 616-623, 2003.
[20] Karl Rupp, Florian Rudolf, and Josef Weinbub. Viennacl-a high level linear algebra library for gpus and multi-core cpus. InIntl. Workshop on GPUs and Scientific Applications, pages 51-56, 2010.
[21] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms.WWW, pages 285-295, 2001.
[22] Sebastian Schelter, Christoph Boden, and Volker Markl. Scalable similarity-based neighborhood methods with mapreduce.RecSys, pages 163-170, 2012.
[23] Sebastian Schelter, Christoph Boden, Martin Schenck, Alexander Alexandrov, and Volker Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. RecSys, pages 281-284, 2013.
[24] Sebastian Schelter, Andrew Palumbo, Shannon Quinn, Suneel Marthi, and Andrew Musselman. Samsara: Declarative machine learning on distributed dataflow systems.Machine Learning Systems workshop at NeurIPS, 2016.
[25] Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, and Gyuri Szarvas. On challenges in machine learning model management.IEEE Data Engineering Bulletin, 41(4):5-15, 2018.
[26] Michael Shindler, Alex Wong, and Adam W Meyerson. Fast and accurate k-means for large datasets.NeurIPS, pages 2375-2383, 2011.
[27] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.NSDI, 2012.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.