Classifiers for discrimination of significant protein residues and protein-protein interaction using concepts of information theory and machine learning. (English) Zbl 1402.92006

Göttingen: Univ. Göttingen, Mathematisch-Naturwissenschaftliche Fakultäten (Diss.). 130 p. (2011).
Summary: The field of protein-analysis is a major research area for bioinformatics. Especially the field of predicting important sites in proteins is in the focus of research to reduce the cost and time involved in the experimental approach of protein-analysis. Due to our success with theoretical approaches for detecting horizontal gene transfer we decided to use a similar approach for the problem of predicting important residues in a protein chain. To be able to have an efficient predictor, classifiers are needed to separate the important protein residues from the rest of the protein chain. Developing and refining two classifiers is the topic of this thesis. The first classifier is based on information theory and uses the concept of entropy and mutual information to rate protein residues. We use multiple sequence alignments to calculate the entropy of a residue pair and its mutual information. This is an indicator for the correlation between these two residues and thus an indicator for co-evolution. Through statistical means, we identify residues that have significant entropy values under the aspect of coevolution. By using a threshold, the top rated residues are classified as important sites of the protein. This classifier is very successful in detecting single nucleotide polymorphism. The second classifier is based on the distribution of amino acids in a protein and focuses on detecting protein interfaces by using concepts from machine learning. Based upon existing data we analyze the neighborhood of known interface residues and use a machine learning algorithm to create a hypothesis. This hypothesis is then used to predict interface residues on a selected protein chain. This classifier has a very good accuracy and the focus can be easily adjusted to fit variable approaches to protein-analysis. These two classifiers offer a good base for predicting important protein sites and show promising results in experiments. Due to the theoretical concepts involved they can be easily adapted for other analytical purposes as well.


92-02 Research exposition (monographs, survey articles) pertaining to biology
92D20 Protein sequences, DNA sequences
92C40 Biochemistry, molecular biology
92-04 Software, source code, etc. for problems pertaining to biology
68T05 Learning and adaptive systems in artificial intelligence
62P10 Applications of statistics to biology and medical sciences; meta analysis
94A15 Information theory (general)


H2r; meta-PPISP; SHARP2
Full Text: Link