Machine learning for text.

*(English)*Zbl 1395.68001
Cham: Springer (ISBN 978-3-319-73530-6/hbk; 978-3-319-73531-3/ebook). xxiii, 493 p. (2018).

The book is a timely addition to the series of books on machine learning approaches applied to various fields. In this one, the focus is on the processing of text (either static or streaming), in which traditional methods are adapted for classification, summarization or information extraction (e.g. entity/relationship recognition, opinion mining, sentiment analysis, etc). The content is organized as a textbook, structured in 14 chapters, each concluding with a summary, a set of references for further reading and exercises to emphasise the main aspects discussed.

In the first chapter the author provides an introduction to machine learning, overviewing some of the approaches applicable to text (and discussed in detail in subsequent chapters) including text pre-processing, text clustering and classification. A brief overview of traditional approaches, such as calculating similarity and classification methods like decision trees, rule-based, Naïve Bayes, nearest-neighbour and linear classifiers, precedes a description of methods for information retrieval and extraction, and summarization.

In the second chapter the similarity measures are presented in more detail. Advances on organising text (either regular or web-specific) into tokens and the extraction of meaningful terms from a set of tokens precede the description of the vector-based representation and normalization of term frequencies. Next, in the third chapter, matrix factorization and topic modelling are discussed. Methods including singular-value decomposition, non-negative matrix factorization and probabilistic latent semantic analyses are presented at large; for each, interesting examples are followed by a discussion of advantages and disadvantages of each method. The chapter concludes with an overview of the latent Dirichlet allocation (LDA) and a section on the applicability of nonlinear transformations based on kernel similarity functions.

The fourth chapter is dedicated to text clustering and feature selection (in the context of matrix factorization methods and nonlinear dimensionality reduction). Standard approaches like \(k\)-means, hierarchical clustering and the usage of ensembles are presented in detail. The chapter closes with an objective evaluation of the various approaches; internal and external validity measures are discussed, the latter with additional information on the relationship between clustering evaluation and supervised learning.

The next three chapters focus on classification methods. In the fifth chapter the author first reviews basic models for feature selection (including the Gini index and mutual information) followed by the Naïve Bayes model on Bernoulli or multinomial distributions, the nearest-neighbours approach, decision trees and random forests, and rule-based classifiers. The emphasis of the sixth chapter is on linear classification and regression, introducing least-squares regression (with the \(L_1\) and \(L_2\) regularizations), support vector machines, logistic regression and concluding with the applicability of nonlinear generalizations of linear models (with a focus on kernel SVMs and the kernel trick). Lastly, in the seventh chapter the author introduces the evaluation of classification performance through methods such as the bias-variance trade-off, bagging, subsampling and boosting, and the split of data into training and testing portions to facilitate the application of methods like hold-out and cross-validation.

The eighth chapter is dedicated to methods for dealing with heterogeneous data, such as joint text mining based on the shared matrix factorization trick. The principles behind factorization machines as well as joint probabilistic modelling techniques are also presented. In the next chapter the focus changes to information retrieval and search engines for indexing and query processing, as well as scoring information in retrieval mode and web crawling techniques. Modern approaches for link-based ranking algorithms (such as the page rank) are also presented.

In the next chapter the author presents text sequence modelling from the deep learning angle; kernel methods, word-context matrix factorization models, neural language models and recurrent neural networks are presented in detail with a variety of examples (including word2vec and sequence-to-sequence learning and machine translation). The eleventh chapter focuses on text summarization. Following the side-by-side introduction of extractive and abstractive summarizations, the topic word method, the latent method and a machine learning approach based on feature extraction are presented. Elements for the latter method, including sentence compression, information fusion and ordering, are also included.

Information extraction is further expanded in the twelfth chapter with methods for name entity recognition, either rule-based or with a hidden Markov model component. The extraction of relationships between the words (including the prediction of relationships) is presented, using the parsing of dependency graphs and the application of kernel methods on convolution trees. This task is brought forward in the thirteenth chapter, which includes methods for opinion mining and sentimental analysis. First, the opinion lexicon is introduced, followed by classification approaches at document level and phrase or sentence level. The chapter concludes with supervised methods for spam detection and approaches for opinions summarization, either based on rating or sentiment.

In the fourteenth chapter the author presents methods for text segmentation and event detection. Approaches for mining text streams using unsupervised techniques and the detection of events using either unsupervised (e.g. nearest neighbour, generative models) or supervised methods (supervised segmentation) are also included.

The book is a remarkable combination of textbook and research overview of the current state of the art in machine learning methods for text processing. The detailed examples throughout the chapters and the numerous diagrammatic representations make it recommendable not only as a perfect starting point for undergraduate students but also as a reliable reference for a wide variety of scientists with diverse backgrounds.

In the first chapter the author provides an introduction to machine learning, overviewing some of the approaches applicable to text (and discussed in detail in subsequent chapters) including text pre-processing, text clustering and classification. A brief overview of traditional approaches, such as calculating similarity and classification methods like decision trees, rule-based, Naïve Bayes, nearest-neighbour and linear classifiers, precedes a description of methods for information retrieval and extraction, and summarization.

In the second chapter the similarity measures are presented in more detail. Advances on organising text (either regular or web-specific) into tokens and the extraction of meaningful terms from a set of tokens precede the description of the vector-based representation and normalization of term frequencies. Next, in the third chapter, matrix factorization and topic modelling are discussed. Methods including singular-value decomposition, non-negative matrix factorization and probabilistic latent semantic analyses are presented at large; for each, interesting examples are followed by a discussion of advantages and disadvantages of each method. The chapter concludes with an overview of the latent Dirichlet allocation (LDA) and a section on the applicability of nonlinear transformations based on kernel similarity functions.

The fourth chapter is dedicated to text clustering and feature selection (in the context of matrix factorization methods and nonlinear dimensionality reduction). Standard approaches like \(k\)-means, hierarchical clustering and the usage of ensembles are presented in detail. The chapter closes with an objective evaluation of the various approaches; internal and external validity measures are discussed, the latter with additional information on the relationship between clustering evaluation and supervised learning.

The next three chapters focus on classification methods. In the fifth chapter the author first reviews basic models for feature selection (including the Gini index and mutual information) followed by the Naïve Bayes model on Bernoulli or multinomial distributions, the nearest-neighbours approach, decision trees and random forests, and rule-based classifiers. The emphasis of the sixth chapter is on linear classification and regression, introducing least-squares regression (with the \(L_1\) and \(L_2\) regularizations), support vector machines, logistic regression and concluding with the applicability of nonlinear generalizations of linear models (with a focus on kernel SVMs and the kernel trick). Lastly, in the seventh chapter the author introduces the evaluation of classification performance through methods such as the bias-variance trade-off, bagging, subsampling and boosting, and the split of data into training and testing portions to facilitate the application of methods like hold-out and cross-validation.

The eighth chapter is dedicated to methods for dealing with heterogeneous data, such as joint text mining based on the shared matrix factorization trick. The principles behind factorization machines as well as joint probabilistic modelling techniques are also presented. In the next chapter the focus changes to information retrieval and search engines for indexing and query processing, as well as scoring information in retrieval mode and web crawling techniques. Modern approaches for link-based ranking algorithms (such as the page rank) are also presented.

In the next chapter the author presents text sequence modelling from the deep learning angle; kernel methods, word-context matrix factorization models, neural language models and recurrent neural networks are presented in detail with a variety of examples (including word2vec and sequence-to-sequence learning and machine translation). The eleventh chapter focuses on text summarization. Following the side-by-side introduction of extractive and abstractive summarizations, the topic word method, the latent method and a machine learning approach based on feature extraction are presented. Elements for the latter method, including sentence compression, information fusion and ordering, are also included.

Information extraction is further expanded in the twelfth chapter with methods for name entity recognition, either rule-based or with a hidden Markov model component. The extraction of relationships between the words (including the prediction of relationships) is presented, using the parsing of dependency graphs and the application of kernel methods on convolution trees. This task is brought forward in the thirteenth chapter, which includes methods for opinion mining and sentimental analysis. First, the opinion lexicon is introduced, followed by classification approaches at document level and phrase or sentence level. The chapter concludes with supervised methods for spam detection and approaches for opinions summarization, either based on rating or sentiment.

In the fourteenth chapter the author presents methods for text segmentation and event detection. Approaches for mining text streams using unsupervised techniques and the detection of events using either unsupervised (e.g. nearest neighbour, generative models) or supervised methods (supervised segmentation) are also included.

The book is a remarkable combination of textbook and research overview of the current state of the art in machine learning methods for text processing. The detailed examples throughout the chapters and the numerous diagrammatic representations make it recommendable not only as a perfect starting point for undergraduate students but also as a reliable reference for a wide variety of scientists with diverse backgrounds.

Reviewer: Irina Ioana Mohorianu (Oxford)

##### MSC:

68-01 | Introductory exposition (textbooks, tutorial papers, etc.) pertaining to computer science |

68-02 | Research exposition (monographs, survey articles) pertaining to computer science |

62H30 | Classification and discrimination; cluster analysis (statistical aspects) |

68P20 | Information storage and retrieval of data |

68T05 | Learning and adaptive systems in artificial intelligence |

68T50 | Natural language processing |