Outlier analysis.

*(English)*Zbl 1291.68004
New York, NY: Springer (ISBN 978-1-4614-6395-5/hbk; 978-1-4614-6396-2/ebook). xv, 446 p. (2013).

This book aims at providing a missing formal view of recent advances in outlier analysis that have been carried out mostly independently in both the computer science and statistics communities. The field of outlier detection is a mature and relevant field with applications to intrusion detection, credit card fraud, medical diagnosis, law enforcement and earth science among others.

The book starts with a high-level introduction to the main issues in outlier detection, namely, appropriate selection of the data model and of the the test to be applied. It reviews basic tests, such as the \(Z\)-value test or extreme value analysis. After that, the book proceeds with a careful categorization of existing techniques in different classes: probabilistic and statistical models, linear models, proximity-based models and information-theoretic models. Special attention must be placed on high-dimensional outlier detection, since the concepts of density and distance do not render any useful information in high-dimensional spaces. Finally, the book also discusses specific techniques for different data types, such as numerical, categorical, mixed, text, time series, data with dependencies, spatial data, and graphs.

Additionally, the book contains a series of carefully created exercises, attempting to make the book useful as a textbook. One thing that seems to be missing in many chapters are specific algorithms. While the book contains some algorithms for some techniques, many others are just described in the text, limiting its usability as a textbook (students need very specific instructions and guidance). Some chapters, like Chapter 6, contain only very high-level descriptions of the techniques, and thus are useful only to those readers who are very familiar with related areas such as classification in machine learning. As a textbook, it would have been more beneficial to try to cover a smaller number of techniques, but to clearly include algorithms and procedures that can be reproduced by students. As it stands, it is only useful as an “overview” of the topic, without going into technical details.

Chapter 2 focuses on probabilistic and statistical models, which are based upon manually selecting a particular type of probability distribution (such as a Gaussian mixture) and then use an EM-like algorithm to fit the distribution to the data. After that, the fitted distribution can be used for predicting data points that are likely not part of the “normal” data. The author makes a very clean division between univariate methods (based on extreme value analysis), multivariate methods, and general probabilistic methods (based on fitting a general distribution to the data sample sing EM). As the author describes, the limitations of probabilistic methods are that one must choose a family of probability distributions to model the data, which will have a number of parameters. Too few parameters leads to poor modeling, and too many leads to overfitting. Moreover, many simplification assumptions are typically made to ease parameter estimation, which are not satisfied in many real-world datasets, thus limiting the attractiveness of these approaches.

Chapter 3 discusses linear models, based on approximating the data with a linear model, such as a line, a plane or a hyper-plane. This can be done, for example, via regression analysis. Points that lay far away from this model are likely to be outliers. Principal component analysis is a typical tool in these models, since it allows the creation of linear models of a given number of dimensions. The author presents several methods for linear analysis together with their limitations. Namely, linear models assume global correlations between data, while some datasets might only exhibit local correlations.

Chapter 4 is concerned with proximity-based methods based on finding data points that are far away from the rest of the points. Proximity-based methods can be divided into three big families: clustering-based (which partition the data points), density-based (which partition the space), and distance-based (based on a distance metric between data points). Distance-based methods can perform a very fine-grained analysis, at an increased computational cost. However, an intrinsic limitation of proximity-based methods is that, for high-dimensional datasets, points become almost equidistant from one another, thus limiting the power of the analysis (curse of dimensionality).

Chapter 5 covers subspace techniques for high-dimensional datasets. Specifically, the chapter covers several families of techniques: those based on search-based methods (e.g. genetic algorithms), those based on distance metrics, and those based on “ensemble methods”.

Chapter 6 discusses supervised outlier detection techniques, which assume that previous examples of outliers are available. This can be viewed as a special type of a classification problem, where one of the solution classes is very rare. The author overviews a wide selection of methods for this scenario, including active learning, resampling, weighting, and others.

Finally, Chapters 7 to 11 discuss outlier detection techniques for specific data representations such as text, graphs, etc. These chapters provide interesting discussions. For example, particularly interesting is the discussion on similarity indexes for non-numerical data. Although the author misses a significant amount of work in this area, it provides a very insightful description of the methods covered in the book. For example, similarity methods that employ domain knowledge to determine similarity between categorical data (e.g. conceptual hierarchies) are not discussed. Also, concerning text data, many approaches other than the cosine similarity have been proposed such as the use of “bag-of-concepts” or structural similarities, based upon using edit distances between the parse-trees. The discussions in Chapters 8 and 9 about time-series analysis are clear, concise and cover most of the relevant work in the area, although perhaps it misses to mention some dynamic Bayesian networks other than HMMs. Although HMMS are the most common, other types of DBNs can be useful for specific discrete, or even continuous, sequences.

The book concludes with Chapter 12, which provides a quite extensive list of practical application domains for outlier analysis. Specifically useful, from a practical point of view, is the very last section of the book, with specific general advise that is often overlooked in much work in this area (e.g. data must be normalized, etc.).

All in all, this is an excellent book. It covers an enormous amount of material with no short amount of insight. Moreover, the book seems to be oriented more towards the experienced researcher who will use this book as reference material rather than towards students given the lack of some specific technical details that are required for the use of this book as a textbook.

The book starts with a high-level introduction to the main issues in outlier detection, namely, appropriate selection of the data model and of the the test to be applied. It reviews basic tests, such as the \(Z\)-value test or extreme value analysis. After that, the book proceeds with a careful categorization of existing techniques in different classes: probabilistic and statistical models, linear models, proximity-based models and information-theoretic models. Special attention must be placed on high-dimensional outlier detection, since the concepts of density and distance do not render any useful information in high-dimensional spaces. Finally, the book also discusses specific techniques for different data types, such as numerical, categorical, mixed, text, time series, data with dependencies, spatial data, and graphs.

Additionally, the book contains a series of carefully created exercises, attempting to make the book useful as a textbook. One thing that seems to be missing in many chapters are specific algorithms. While the book contains some algorithms for some techniques, many others are just described in the text, limiting its usability as a textbook (students need very specific instructions and guidance). Some chapters, like Chapter 6, contain only very high-level descriptions of the techniques, and thus are useful only to those readers who are very familiar with related areas such as classification in machine learning. As a textbook, it would have been more beneficial to try to cover a smaller number of techniques, but to clearly include algorithms and procedures that can be reproduced by students. As it stands, it is only useful as an “overview” of the topic, without going into technical details.

Chapter 2 focuses on probabilistic and statistical models, which are based upon manually selecting a particular type of probability distribution (such as a Gaussian mixture) and then use an EM-like algorithm to fit the distribution to the data. After that, the fitted distribution can be used for predicting data points that are likely not part of the “normal” data. The author makes a very clean division between univariate methods (based on extreme value analysis), multivariate methods, and general probabilistic methods (based on fitting a general distribution to the data sample sing EM). As the author describes, the limitations of probabilistic methods are that one must choose a family of probability distributions to model the data, which will have a number of parameters. Too few parameters leads to poor modeling, and too many leads to overfitting. Moreover, many simplification assumptions are typically made to ease parameter estimation, which are not satisfied in many real-world datasets, thus limiting the attractiveness of these approaches.

Chapter 3 discusses linear models, based on approximating the data with a linear model, such as a line, a plane or a hyper-plane. This can be done, for example, via regression analysis. Points that lay far away from this model are likely to be outliers. Principal component analysis is a typical tool in these models, since it allows the creation of linear models of a given number of dimensions. The author presents several methods for linear analysis together with their limitations. Namely, linear models assume global correlations between data, while some datasets might only exhibit local correlations.

Chapter 4 is concerned with proximity-based methods based on finding data points that are far away from the rest of the points. Proximity-based methods can be divided into three big families: clustering-based (which partition the data points), density-based (which partition the space), and distance-based (based on a distance metric between data points). Distance-based methods can perform a very fine-grained analysis, at an increased computational cost. However, an intrinsic limitation of proximity-based methods is that, for high-dimensional datasets, points become almost equidistant from one another, thus limiting the power of the analysis (curse of dimensionality).

Chapter 5 covers subspace techniques for high-dimensional datasets. Specifically, the chapter covers several families of techniques: those based on search-based methods (e.g. genetic algorithms), those based on distance metrics, and those based on “ensemble methods”.

Chapter 6 discusses supervised outlier detection techniques, which assume that previous examples of outliers are available. This can be viewed as a special type of a classification problem, where one of the solution classes is very rare. The author overviews a wide selection of methods for this scenario, including active learning, resampling, weighting, and others.

Finally, Chapters 7 to 11 discuss outlier detection techniques for specific data representations such as text, graphs, etc. These chapters provide interesting discussions. For example, particularly interesting is the discussion on similarity indexes for non-numerical data. Although the author misses a significant amount of work in this area, it provides a very insightful description of the methods covered in the book. For example, similarity methods that employ domain knowledge to determine similarity between categorical data (e.g. conceptual hierarchies) are not discussed. Also, concerning text data, many approaches other than the cosine similarity have been proposed such as the use of “bag-of-concepts” or structural similarities, based upon using edit distances between the parse-trees. The discussions in Chapters 8 and 9 about time-series analysis are clear, concise and cover most of the relevant work in the area, although perhaps it misses to mention some dynamic Bayesian networks other than HMMs. Although HMMS are the most common, other types of DBNs can be useful for specific discrete, or even continuous, sequences.

The book concludes with Chapter 12, which provides a quite extensive list of practical application domains for outlier analysis. Specifically useful, from a practical point of view, is the very last section of the book, with specific general advise that is often overlooked in much work in this area (e.g. data must be normalized, etc.).

All in all, this is an excellent book. It covers an enormous amount of material with no short amount of insight. Moreover, the book seems to be oriented more towards the experienced researcher who will use this book as reference material rather than towards students given the lack of some specific technical details that are required for the use of this book as a textbook.

Reviewer: Santiago Ontanon (Philadelphia)