×

Bioinformatics. Volume II: structure, function, and applications. 2nd edition. (English) Zbl 1384.92002

Methods in Molecular Biology 1526. New York, NY: Humana Press (ISBN 978-1-4939-6611-0/hbk; 978-1-4939-6613-4/ebook). xi, 426 p. (2017).
The second volume of “Bioinformatics”, edited by Jonathan M. Keith, covers aspects related to the understanding of the biological molecules as part of systems of interacting elements; the authors focus mainly on algorithms for the interpretation of the structure and its links to function and identification of pathways and networks. The book is structured into three sections.
The first section of the book presents approaches for linking the structure and function and methods for the identification of pathways (gene networks). The first chapter focuses on a hybrid computational and experimental approach that combines sparse experimental restraints with modelling algorithms for increasing the accuracy of 3-D protein models. Following a description of the state-of-the-art, the nuclear magnetic resonance (NMR) spectroscopy, the authors present the characteristics of protein-ligand interactions and of protein-protein complexes. The computational approach described in detail is the Rosetta structure calculation algorithm.
In the second chapter, the authors describe the inference of function from homology focusing on annotation based on protein-domain detection and sequence similarity methods (BLAST). In the third chapter, a method to infer functional relationships from the conservation of gene order in prokaryotes is proposed. The approach is based on the hypothesis that the conservation of adjacent genes may indicate a policistronic transcription unit. Each step of the analysis, including the description of neighbours database, is presented in detail.
In the fourth chapter, the authors present the structural and functional annotation of long non-coding RNAs (lncRNAs) using as starting point annotation databases (e.g., NCBI, ENSEMBL), transcriptomic data and multiple alignments. For detecting the homology of non-functional RNAs tools such as infernal are employed coupled with the detection of secondary structures (identification of functional 2-D motifs via comparative genomics). In the fifth chapter, the authors describe the identification of functional gene networks based on phylogenetic profiles under the hypothesis that functional constraints lead to similar patterns during speciation. The examples are based on protein sequences and software for homology searches (BLAST). For the assessment of functional associations, measures of similarity between the profiles are discussed.
In the sixth chapter, the authors view methods for inferring genome-wide interaction networks such as C3NET, RN, ARACNE, CLR and MRNET; implementation details and examples on publicly available datasets are included. The seventh chapter presents an approach for integrating heterogeneous datasets such as gene expression, copy number aberration (CNA), miRNA expression, methylation data and protein-protein interactions for the identification of cancer modules. The methods described in detail include iMCMC (identity Mutated Core Module Cancer), VToD and the approaches presented in [Z. Wen et al., “An integrated approach to identify causal network modules of complex diseases with application to colorectal cancer”, J. Am. Med. Inform. Assoc. 20, No. 4, 659–667 (2013; doi:10.1136/amiajnl-2012-001168)] and [E. Cerami et al, “Automated network analysis identifies core pathways in glioblastoma”, PLOS ONE 5, No. 2, e8918, 10 p. (2010; doi:10.1371/journal.pone.0008918)]; these propose different frameworks with variable proportions of data driven inference and topological properties of networks. The last chapter of the first part focuses on metabolic pathways. Following a description of general text mining tools and approaches for the recognition of named entities, the authors present a heuristic metabolic pathway extraction method exemplified on public data.
The second part focuses on applications of data mining methods; it commences with a chapter on the analysis of genome-wide association data (GWAS) aimed at linking complex genetic traits with diseases. The author also presents a step-by-step guide for both binary and qualitative traits analyses, followed by a section on data quality control and cleaning and an overview of methods for genotype imputations and association testing. Chapter 10 focuses on methods for adjusting for familial relatedness that may confound the results from GWAS studies; both ancient (population stratification) and recent (familial structure) relatedness are discussed. The authors also include a side-by-side comparison of different methods on an dataset consisting of Utah residents with ancestry from northern and western Europe.
In Chapter 11, the author presents quantitative trait loci (QTL) including both a generic description and standard workflows to link these with GWAS studies. The software MERLIN is exemplified since it supports both parametric and nonparametric linkage analyses, association studies and Mendelian error detection. In Chapter 12, methods for integrating high dimensional profiling for computational diagnosis are presented; the examples focus on integrating gene expression and the profiling of metabolites. Following a discussion of pitfalls encountered for classification problems the authors present an overview of current software available for addressing this task (e.g., diagonal linear discriminant analysis (DLDA) and univariate gene selection).
Chapter 13 focuses on approaches for the computational evaluation and quantification of molecular similarity, and its extrapolation to properties, applied in chemoinformatics. Starting with an overview of key concepts such as 2- versus 3-dimensional similarity and global versus local similarity, the author overviews the effects of several similarity functions and search strategies. The next chapter continues with a description of data mining applied on compound activity data for drug discovery. Following the description of public domain repositories, the effect of data volumes and complexities are described along with a brief overview of methods (e.g., virtual compound screening, identification of matched molecular pairs and evaluation of activity profiles).
In Chapter 15, the authors present approaches for the study of antibody repertoire with next-generation sequencing (rep-seq methods). Starting with a description of background concepts such as the elements of antigen recognition, receptor development and production of isotypes, the advantage of high throughput investigation of the antibody repertoire and available tools are discussed. The chapter concludes with an example on the monitoring of humoral immune response to vaccination. In the following chapter, the authors present a mathematical method for the visualisation of large-scale datasets using QAPgrid approach applied on biomarker identification of cell specific transcriptomic signatures. The method is described in detail using examples, pseudocode and diagrams. Chapter 17 focuses on a method for the identification of discriminative features for robust within-set classification for breast cancer diagnosis. First, the \(k\)-feature set problem is introduced followed by a description of the dataset used as example. The task is then rephrased to a \((\alpha,\beta)\)-\(k\)-feature problem of optimisation in graphs. The reduction techniques and the memetic algorithm (including the results on the test data) are described in detail.
The third part describes 4 computational methods. In Chapter 18, the authors present an inference-based approach for determining cell signalling pathways using proteomic datasets as input. The examples focus on the mitogen activated protein (map) kinase pathway. The next chapter reviews clustering-methods including hierarchical clustering, \(k\)-means, self organising maps and model based versions, all exemplified for the identification of gene profiles. A brief overview of methodological aspects is complemented with a detailed example.
Chapter 20 presents approaches for developing parameterized algorithms for finding exact solutions of NP hard biological problems. Methods such as kernelisation (data reduction with maintained effectiveness), depth bounded search trees, dynamic programming, tree decomposition of graphs, colour coding for the identification of small patterns in graphs and iterative compression based on a recursive solving of smaller instances are all presented on case studies. This part concludes with Chapter 21 describing approaches for visualising information derived from biological datasets. The usage of heat maps and forced-based network layouts on graphs are discussed in detail.
Being constructed as a collection of chapters illustrating the state-of-the-art methods in a wide variety of bioinformatics approaches, the book is as an excellent starting point for a wide audience including undergraduates, graduates and established researchers alike. The amount of detail presented for each methodological approach, coupled with extensive examples, facilitate not only the understanding of the topic but also the bridging between the various tasks associated with the mining of big (high throughput) biological datasets.

MSC:

92-02 Research exposition (monographs, survey articles) pertaining to biology
92-08 Computational methods for problems pertaining to biology
92D20 Protein sequences, DNA sequences
92C40 Biochemistry, molecular biology
92C42 Systems biology, networks
92C50 Medical applications (general)
00B15 Collections of articles of miscellaneous specific interest

Citations:

Zbl 1378.92002
PDFBibTeX XMLCite
Full Text: DOI