×

Machine learning using R. With time series and industry-based use cases in R. 2nd edition. (English) Zbl 1423.68007

New York, NY: Apress (ISBN 978-1-4842-4214-8/pbk; 978-1-4842-4215-5/ebook). xxiv, 700 p. (2019).
The book is a textbook-like collection of theory and hands-on examples aimed at providing a smooth introduction to both machine learning approaches and R as a programming/analysis environment. The book comprises eleven chapters, each focusing on a particular task, e.g., data preparation and exploration (Chapter 2) or machine learning theory and practice (Chapter 6) ambitiously covering classical supervised and unsupervised approaches.
In the first chapter, “Introduction to machine learning and R”, the authors provide a high-level overview of theoretical concepts used throughout the book. The first subsection overviews the timeline of machine learning starting with statistical learning, “modern” machine learning and continuing with concepts related to artificial intelligence and data mining and data science. Next, a brief overview of probability and statistics is presented, with an emphasis on the definitions related to counting events, probability distributions and approaches for hypothesis testing. Also included in this chapter are some basic elements of R as a programming environment and the description of the steps of a process-flow based on machine learning analyses.The second chapter focuses on approaches for data pre-processing and exploration. It includes an overview of input data, such as variable types and data formats, tips for organising multiple sources of data and methods for reshaping the data to make it comparable across projects. Also included are some elements of exploratory data analysis based on summary statistics. The chapter concludes with a case study on credit card fraud.
In the third chapter the authors discuss sampling and resampling. Following an introduction of definitions related to sampling, including the analysis of the sampling distribution, the population and sample mean and variance, sampling with and without replacement, and potential biases, the authors also present elements of statistical theory related to this topic, i.e., the law of large numbers and the central limit theorem. The chapter concludes with a more detailed overview of probability sampling techniques such as simple random sampling, systematic random sampling, stratified random sampling and cluster and bootstrap sampling. The fourth chapter outlines approaches for data visualisation in R; these include a variety of line plots, scatter plots, summary plots, all of them accompanied by examples and the relevant code. The fifth chapter focuses on the features relevant for a given question. Using the data summary as a starting point, the authors proceed to the characterisation of features, e.g., continuous or categorical, ranked based on their relevance, and conclude with the identification of a subset of features responsible for most of the variation observed in the data set; for the latter, filter, wrapper and embedded methods accompanied by principal component analysis are discussed.
The sixth chapter is built as a wide, comprehensive overview of machine learning approaches; following a description of methods categorised as supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning, the authors describe, one by one, most of the standard approaches including regression and correlation analysis, support vector machines, decision trees, the naïve Bayes method, cluster analysis, association rule mining in artificial neural networks. All the theoretical concepts are accompanied by examples and R code. The chapter concludes with a model-building checklist that guides the users in determining which method might be more appropriate for a particular real-world problem. In the seventh chapter the authors describe approaches for model performance and evaluation for both continuous and discrete outputs. The cross-validation and bootstrap sampling approaches are included. In the eighth chapter methods for improving the model performance are reviewed; the focus is on the caret package in R which is used for different types of searches for optimal ranges for hyper-parameters. The concept of ensemble learning (based on voting schemes) is also presented and illustrated using bagging trees, gradient boosting with a decision tree and by blending kNN and Rpart.
The ninth chapter focuses on the modelling of time series with an emphasis on tests for stationarity, and the ACF, AR, PACF and MA models. The ARIMA model and linear regression with AR errors are discussed in detail. In the tenth chapter the authors discuss the scalability of machine learning approaches. Following an overview of standard techniques for distributed processing and storage, including the Google file system (GFS), the MapReduce and parallel execution in R, an additional set of three other environments appropriate for this task (Hadoop, Spark and H20) are presented. The book concludes with a series of examples on how to use the Deep Learning Keras and TensorFlow R-libraries. The eleventh chapter commences with an overview of the most frequently used learning architectures: the convolutional neural networks (CNNs), the recurrent neural networks (RNNs) and the generative adversarial networks (GANs), followed by a summary of the deep learning toolsets in R. The chapter ends with a use-case on the identification of duplicate questions in Quora.
The wide variety of concepts and the unique blend of theory and exercises recommend this book as a reliable starting point for researchers looking for a deeper understanding of machine learning approaches or attempting to use R as an environment for the processing of their real-world data sets. The chapters are richly decorated with examples and code which emphasise the definitions usually overviewed at the start of the chapters. The amount of theory preceding the examples is sufficient for a good understanding of concepts; moreover, the references scattered throughout provide further support if the readers wish to pursue specific analyses. The book is suitable for a wide variety of backgrounds and skill sets, it is addressed to researchers from undergraduates to postgraduates and established researchers and from a wide range of interdisciplinary backgrounds such as computer science, mathematics, physics and biology.

MSC:

68-01 Introductory exposition (textbooks, tutorial papers, etc.) pertaining to computer science
68-02 Research exposition (monographs, survey articles) pertaining to computer science
68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI