##
**Clustering for data mining. A data recovery approach.**
*(English)*
Zbl 1083.68099

Chapman & Hall/CRC Computer Science & Data Analysis Series. Boca Raton, FL: Chapman & Hall/CRC Press (ISBN 1-58488-534-3/hbk). xxiii, 266 p. (2005).

Clustering is to divide data into groups of similar objects so that certain patterns hidden in the data can be made explicit. For example, given a set of web sites a group of people visited, some one might be interested in looking for the surf pattern of this group. Thus, clustering techniques play important roles in data mining and in knowledge discovery. It has been developed into a very important and promising research area with many important applications in statistical analysis, image analysis, pattern recognition. Although many different techniques based on hierarchical analysis, partitioning, density, grid, etc. have been suggested and applied, there does not exist a coherent theoretical foundation. In particular, there does not exist a clear indicator as which one of those techniques should be used in a particular setting, and what criterion should be applied to measure its success. Since different techniques often lead to different results, an apparent question is thus which result you should believe in.

This book tries to put an end to this situation by suggesting and arguing that if the data as recovered based on a clustering fits the original data, this clustering is a good one. The closer the fit is, the better. More formally, as the author points out in Section 5.1, given an observed data \(Y,\) \(F(A),\) the data we recover based on the data model \(A\) we use to characterize \(Y,\) should satisfy the following “underlying equation”: \[ Y=F(A)+E, \] where \(E,\) the “residual” caused by such flaws in data sampling, feature selection, and phenomenon modeling, should be minimized. This book then goes through eight chapters to argue why it should be done this way and explain how this can be done. More specifically, the author uses two quite informative introductory chapters to define and discuss some of the general concepts related to clustering, including exemplary problems and various perspective; and table data, including its nature, various features, and main analytical tools. He then, in Chapters 3 and 4, focuses on two of the popular clustering techniques, the \(K\)-means clustering and the Ward hierarchical clustering, since they are both based on statistical analysis, and share the same classifying criterion of least square distance minimization. Using these two techniques as leading examples, the author then presents the general data recovery model as a unifying theme for the clustering field, and other techniques such as principal component analysis in Chapter 5. The rest of the book consists of a discussion of other clustering techniques such as various extensions of the \(K\)-means clustering, fuzzy clustering, soft-organizing maps and expectation-maximization; and further exploration of other related issues such as feature selection, validity and reliability of clusters, approaches of dealing with missing data, options for data pre-processing and standardization, etc., in Chapter 6 and 7, respectively.

After going through this book, there is no doubt in my mind that the author knows what he writes. Moreover, a large number(58) of examples taken from many different walks of life are used to demonstrate most of the techniques as discussed in the book. A nice feature I particularly like is that at the end of all these examples, a cross reference is given as to where this example will be used later and analyzed, sometimes in as many as 15 different places. This book also contains an extensive bibliography with 142 items, and a useful, two-level, index. On the other hand, this book contains a segment of goals and a list of key concepts, referred to as the base words, at the beginning of each chapter, but there are no exercises assigned anywhere in the book. This might make this book less likely to be adopted as a textbook. Nevertheless, this book is definitely a wonderful resource for those who are interested in the topics as to what clustering is and how it should be applied.

This book tries to put an end to this situation by suggesting and arguing that if the data as recovered based on a clustering fits the original data, this clustering is a good one. The closer the fit is, the better. More formally, as the author points out in Section 5.1, given an observed data \(Y,\) \(F(A),\) the data we recover based on the data model \(A\) we use to characterize \(Y,\) should satisfy the following “underlying equation”: \[ Y=F(A)+E, \] where \(E,\) the “residual” caused by such flaws in data sampling, feature selection, and phenomenon modeling, should be minimized. This book then goes through eight chapters to argue why it should be done this way and explain how this can be done. More specifically, the author uses two quite informative introductory chapters to define and discuss some of the general concepts related to clustering, including exemplary problems and various perspective; and table data, including its nature, various features, and main analytical tools. He then, in Chapters 3 and 4, focuses on two of the popular clustering techniques, the \(K\)-means clustering and the Ward hierarchical clustering, since they are both based on statistical analysis, and share the same classifying criterion of least square distance minimization. Using these two techniques as leading examples, the author then presents the general data recovery model as a unifying theme for the clustering field, and other techniques such as principal component analysis in Chapter 5. The rest of the book consists of a discussion of other clustering techniques such as various extensions of the \(K\)-means clustering, fuzzy clustering, soft-organizing maps and expectation-maximization; and further exploration of other related issues such as feature selection, validity and reliability of clusters, approaches of dealing with missing data, options for data pre-processing and standardization, etc., in Chapter 6 and 7, respectively.

After going through this book, there is no doubt in my mind that the author knows what he writes. Moreover, a large number(58) of examples taken from many different walks of life are used to demonstrate most of the techniques as discussed in the book. A nice feature I particularly like is that at the end of all these examples, a cross reference is given as to where this example will be used later and analyzed, sometimes in as many as 15 different places. This book also contains an extensive bibliography with 142 items, and a useful, two-level, index. On the other hand, this book contains a segment of goals and a list of key concepts, referred to as the base words, at the beginning of each chapter, but there are no exercises assigned anywhere in the book. This might make this book less likely to be adopted as a textbook. Nevertheless, this book is definitely a wonderful resource for those who are interested in the topics as to what clustering is and how it should be applied.

Reviewer: Zhizhang Shen (Plymouth/New Hampshire)

### MSC:

68T05 | Learning and adaptive systems in artificial intelligence |

68T30 | Knowledge representation |

62-07 | Data analysis (statistics) (MSC2010) |

68P05 | Data structures |

68P15 | Database theory |

68-01 | Introductory exposition (textbooks, tutorial papers, etc.) pertaining to computer science |