Data mining techniques for the life sciences. 2nd edition. (English) Zbl 1353.92002

Methods in Molecular Biology 1415. New York, NY: Humana Press/Springer (ISBN 978-1-4939-3570-3/hbk; 978-1-4939-3572-7/ebook). xiii, 552 p. (2016).
This is a second edition which follows the principle of the first edition: “good science is made by good questions”, i.e., in order to retrieve meaningful information from a large, inert dataset, efficient and adequate methods are needed. Besides the original topics included in the first edition, databases, computational techniques and prediction methods, here the editors and authors discuss approaches to tackle big data issues as well.
The book is structured in four parts. The first one, Databases, consists of eight chapters describing various protein databases and their use for data mining, classification and exploration. In the first chapter, the author presents an update on the databases and resources publicly available at the National Center for Biotechnology Information (NCBI). The structure of databases of genomes, genes and their variants and expression data and efficient way to navigate, search and retrieve information are presented. Tools for comparative genomics tasks and for adequate visualization are also described. In the second chapter, the author focuses on the usefulness of protein structure databases such as Protein Data Bank, wwPDB, JenaLib, OCA, PDBe, PDBsum, Pfam and others. He underlines the importance of classification based on the 3D structures and the ability to identify evolutionary relationships from the structure, when the sequence comparison cannot provide a clear answer. The third chapter describes the MIntAct project and the databases of molecular interactions (IMEx databases) which enable an in-depth description of the interactome in several model organisms. In the fourth chapter, the authors propose the use of a protein thermodynamic database (ProTherm, containing experimental measurements such as circular dichroism, differential scanning calorimetry, fluorescence spectroscopy) for the understanding of protein mutant stability (trends on mutational effects) and an adequate design of stable mutants (and resulting amino-acid properties). This information is used to determine the relation between thermodynamics, structure and function of proteins. In the fifth chapter, the authors present Kbdock, a protein domain structure database, and its associated website, which can be used for the classification and exploration of 3D protein domain interactions (domain-domain interactions, DDIs and domain-peptide interactions DPIs at the Pfam domain level). In the sixth chapter, the authors present the Protein Data Bank (PDB), a standard resource for macromolecular structures, and discuss the challenges of standardizing models, generating annotations and preserving the uniformity of the entries, especially those obtained using X-ray crystallography. PDB\(\_\)REDO is offered as an alternative. In the seventh chapter, the authors propose a set of standards for extracting high quality, non-redundant PDB subsets. The authors argue that the crystallographic resolution is not sufficient; however, additional features such as B-factor values, quality of the electron density maps and the temperature of the diffraction experiments could be employed as stringent criteria. The last chapter of this part describes a protocol for in silico homology-based annotation of large protein datasets which is based on the information available on manually curated collections of protein families (e.g., Pfam).
The second part, Computational techniques, comprises five chapters and commences with a chapter on the identification and correction of errors in protein sequences submitted to public databases using the MisPred and FixPred tools and the Pfam database. In the tenth chapter, the authors present Cryo-Electron Microscopy (Cryo-EM) and Cryo-EM density maps of protein assemblies which, coupled with high-resolution structures, can be used for improving the accuracy of fitted atomic models and for the analysis of pseudo-atomic models. The study of evolutionary conservation of residues of protein structures and multiple alignments of homologous proteins for the detection of errors in the fitting of the models are also presented. The eleventh chapter proposes a novel amino acid substitution matrix: MIQS, which enables the identification of distantly related proteins. Using a principal component analysis (PCA) approach on a subspace of existing matrices, the authors highlight that MIQS on benchmarking shows a better accuracy than other options. In the twelfth chapter, the authors discuss the pros and cons of high-throughput -omics assays and highlight strategies and guidelines that may prevent errors in experimental design or data analysis. The importance of adequate replication and multiple testing is presented at large, from multiple angles. The thirteenth chapter focuses on an efficient method (in terms of speed and accuracy) for mapping RNA-seq reads: the STAR (spliced transcripts alignment to a reference) approach. The main options, parameters and best practice advices are also discussed.
The third part of the book, Prediction methods, comprises twelve chapters. It commences with the description of a method for predicting protein conformational disorder, i.e., the lack of a stable 3D structure. The cause was identified as particular regions at amino-acid level, facilitating their computational prediction. Here, the authors present several methods (e.g., DisMeta, GeneSilico MetaDisorder MD2, MetaPrDOS, Multicom, MFDp, Pondr-Fit, PredictProtein, MeDor) used for the identification of such regions involved in induced folding. In the fifteenth chapter, the authors focus on classes of protein kinases, classified on sequence similarity, regulating several different signalling pathways. The ability to phosphorylate is linked to the substrate specificity which, in turn, is determined by the residues at specific binding sites. This feature is assessed here and used to refine the classification scheme for kinases. The sixteenth chapter focuses on methods based on the spectral-statistical approach (2S-approach) for revealing latent regular structures (and latent periodicity) in DNA sequences from the HeteroGenome database. The authors describe the core of these methods, based on approximate tandem repeats, and discuss examples showing a correlation between the latent profile periodicity and the structural-functional properties of the proteins. In the seventeenth chapter, the authors discuss a major challenge in protein structure research, the assessment of protein crystallizability by obtaining diffracting quality crystals. They compare several methods for selecting suitable protein targets for crystallization, the valuation of construct optimization and crystallization condition design. The eighteenth chapter presents ngs.plot, a tool for the analysis and visualization of ChIP-seq and RNA-seq alignments, which can run on command line and as a web-based workflow on the Galaxy framework. More specifically, this approach facilitates the identification of spatial relationships between enriched regions and genomic features. The nineteenth chapter discusses the use of ontologies and the web ontology language (OWL). The authors comment on efficient approaches to extract information based on the structure and content of such ontologies and on potential pitfalls such as choosing a “right” similarity measure. In the next chapter (20), the authors focus on the functional analysis of metabolomics data, more specifically on an annotation enrichment analysis, similar to the one performed for transcriptomics or proteomics data. In the twenty-first chapter, the authors present the data analysis of bacterial transcriptomes, with an emphasis on the main bioinformatics steps to process a next generation sequencing dataset, from the raw data to the expression analysis of assembled or annotated genomes. In the twenty-second chapter, the authors conduct a broad overview of computational methods for predicting the pathophysiological effects of non-synonymous variants based on the investigation of the vast human genetic variability. Using a variety of methods, the identification and characterization of several kinds of mis-matches, with the potential to induce pathogenic phenotypes or disease susceptibility, is discussed. The twenty-third chapter is dedicated to the objective valuation of methods and techniques for the prediction of drug-target interaction and the evaluation of drug repositioning. Some extensions beyond the network case are also presented. In the twenty-fourth chapter, the authors describe a prediction method for protein residue contacts, DNcon, which is thought to be a promising approach for solving the enduring problem of ab initio protein structure prediction. The third part concludes with a chapter on protein sequence-based function prediction and the use of the ANNOTATOR environment.
The last part of the book focuses on big data and consists of two chapters. The former discusses the use of metagenomic analysis for the description of gut microbiota. Based on the codon usage profiles, the authors show that the bias present throughput the entire microbial community could be used to predict its lifestyle-specific metabolism. The last chapter describes the iPlant initiative established to facilitate the processing of large plant datasets.
The style of the book and the assortment of topics which are presented make it accessible to a wide range of audiences, from undergraduates to established researchers, and from a variety of backgrounds, biologists, chemists, bioinformaticians. This collection of articles highlighting the state of the art for protein analyses, can also be used as a brief yet thorough starting point for post-graduate projects.


92-01 Introductory exposition (textbooks, tutorial papers, etc.) pertaining to biology
62-01 Introductory exposition (textbooks, tutorial papers, etc.) pertaining to statistics
62-07 Data analysis (statistics) (MSC2010)
92B15 General biostatistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
Full Text: DOI