×

zbMATH — the first resource for mathematics

Enzymes/non-enzymes classification model complexity based on composition, sequence, 3D and topological indices. (English) Zbl 1400.92405
Summary: The huge amount of new proteins that need a fast enzymatic activity characterization creates demands of protein QSAR theoretical models. The protein parameters that can be used for an enzyme/non-enzyme classification includes the simpler indices such as composition, sequence and connectivity, also called topological indices (TIs) and the computationally expensive 3D descriptors. A comparison of the 3D versus lower dimension indices has not been reported with respect to the power of discrimination of proteins according to enzyme action. A set of 966 proteins (enzymes and non-enzymes) whose structural characteristics are provided by PDB/DSSP files was analyzed with Python/Biopython scripts, STATISTICA and Weka. The list of indices includes, but it is not restricted to pure composition indices (residue fractions), DSSP secondary structure protein composition and 3D indices (surface and access). We also used mixed indices such as composition-sequence indices (Chou’s pseudo-amino acid compositions or coupling numbers), 3D-composition (surface fractions) and DSSP secondary structure amino acid composition/propensities (obtained with our Prot-2S Web tool). In addition, we extend and test for the first time several classic TIs for the Randic’s protein sequence Star graphs using our Sequence to Star Graph (S2SG) Python application. All the indices were processed with general discriminant analysis models (GDA), neural networks (NN) and machine learning (ML) methods and the results are presented versus complexity, average of Shannon’s information entropy (Sh) and data/method type. This study compares for the first time all these classes of indices to assess the ratios between model accuracy and indices/model complexity in enzyme/non-enzyme discrimination. The use of different methods and complexity of data shows that one cannot establish a direct relation between the complexity and the accuracy of the model.

MSC:
92D20 Protein sequences, DNA sequences
68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Abou-Shaaban, R.R.; Khamees, H.A.; Abou-Auda, H.S.; Simonelli, A.P., Atom level electrotopological state indexes in QSAR: designing and testing anti-thyroid agents, Pharm. res., 13, 129-136, (1996)
[2] Agrawal, V.K.; Banerji, M.; Gupta, M.; Singh, J.; Khadikar, P.V.; Supuran, C.T., QSAR study on carbonic anhydrase inhibitors: water-soluble sulfonamides incorporating β-alanyl moieties, possessing long lasting – intra ocular pressure lowering properties—a molecular connectivity approach, Eur. J. med. chem., 40, 10, 1002-1012, (2005)
[3] Aguero-Chapin, G.; Gonzalez-Diaz, H.; Molina, R.; Varona-Santos, J.; Uriarte, E.; Gonzalez-Diaz, Y., Novel 2D maps and coupling numbers for protein sequences. the first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from psidium guajava L, FEBS lett., 580, 3, 723-730, (2006)
[4] Althaus, I.W.; Chou, J.J.; Gonzales, A.J.; Diebel, M.R.; Chou, K.C.; Kezdy, F.J.; Romero, D.L.; Aristoff, P.A.; Tarpley, W.G.; Reusser, F., Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E, J. biol. chem., 268, 6119-6124, (1993)
[5] Althaus, I.W.; Gonzales, A.J.; Chou, J.J.; Diebel, M.R.; Chou, K.C.; Kezdy, F.J.; Romero, D.L.; Aristoff, P.A.; Tarpley, W.G.; Reusser, F., The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase, J. biol. chem., 268, 14875-14880, (1993)
[6] Andraos, J., Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws: new methods based on directed graphs, Can. J. chem., 86, 342-357, (2008)
[7] Arteca, G.A.; Tapia, O.J., Characterization of fold diversity among proteins with the same number of amino acid residues, Chem. inf. comput. sci., 39, 4, 642-649, (1999)
[8] Bairoch, A., The ENZYME database in 2000, Nucl. acids res., 28, 304-305, (2000)
[9] Bate, P.; Warwicker, J., Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods, J. mol. biol., 340, 2, 263-276, (2004)
[10] Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E., The protein data bank, Nucl. acids res., 28, 235-242, (2000)
[11] Brenner, S.E.; Koehl, P.; Levitt, M., The ASTRAL compendium for sequence and structure analysis, Nucl. acids res., 28, 254-256, (2000)
[12] Bruno-Blanch, L.; Galvez, J.; Garcia-Domenech, R., Topological virtual screening: a way to find new anticonvulsant drugs from chemical diversity, Bioorg. med. chem. lett., 13, 16, 2749-2754, (2003)
[13] Cai, Y.D.; Chou, K.C., Using functional domain composition to predict enzyme family classes, J. proteome res., 4, 109-111, (2005)
[14] Cai, Y.D.; Chou, K.C., Predicting enzyme subclass by functional domain composition and pseudo amino acid composition, J. proteome res., 4, 967-971, (2005)
[15] Cai, Y.D.; Zhou, G.P.; Chou, K.C., Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition, J. theor. biol., 234, 145-149, (2005)
[16] Chandonia, J.M.; Walker, N.S.; Conte, L.L.; Koehl, P.; Levitt, M.; Brenner, S.E., ASTRAL compendium enhancements, Nucl. acids res., 30, 260-263, (2002)
[17] Chou, K.C., Graphical rules in steady and non-steady enzyme kinetics, J. biol. chem., 264, 12074-12079, (1989)
[18] Chou, K.C., Review: applications of graph theory to enzyme kinetics and protein folding kinetics. steady and non-steady state systems, Biophys. chem., 35, 1-24, (1990)
[19] Chou, K.C., A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins, J. biol. chem., 268, 16938-16948, (1993)
[20] Chou, K.C., Review: prediction of HIV protease cleavage sites in proteins, Anal. biochem., 233, 1-14, (1996)
[21] Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid, 2001, Vol. 44, 60) 43, 246-255.
[22] Chou, K.C., Review: structural bioinformatics and its impact to biomedical science, Curr. med. chem., 11, 2105-2134, (2004)
[23] Chou, K.C., Molecular therapeutic target for type-2 diabetes, J. proteome res., 3, 1284-1288, (2004)
[24] Chou, K.C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21, 1, 10-19, (2005)
[25] Chou, K.C.; Cai, Y.D., Predicting enzyme family class in a hybridization space, Protein sci., 13, 2857-2863, (2004)
[26] Chou, K.C.; Cai, Y.D., Using GO-pseaa predictor to predict enzyme sub-class, Biochem. biophys. res. commun., 325, 506-509, (2004)
[27] Chou, K.C.; Cai, Y.D., Predicting protein – protein interactions from sequences in a hybridization space, J. proteome res., 5, 316-322, (2006)
[28] Chou, K.C.; Elrod, D.W., Prediction of enzyme family classes, J. proteome res., 2, 183-190, (2003)
[29] Chou, K.C.; Shen, H.B., Review: recent progresses in protein subcellular location prediction, Anal. biochem., 370, 1-16, (2007)
[30] Chou, K.C.; Shen, H.B., Cell-ploc: a package of web-servers for predicting subcellular localization of proteins in various organisms, Nat. protoc., 3, 153-162, (2008)
[31] Chou, K.C.; Zhang, C.T., Review: prediction of protein structural classes, Crit. rev. biochem. mol. biol., 30, 275-349, (1995)
[32] Chou, K.C.; Kezdy, F.J.; Reusser, F., Review: steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases, Anal. biochem., 221, 217-230, (1994)
[33] Chou, K.C.; Wei, D.Q.; Zhong, W.Z., Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS. (erratum: ibid., 2003, vol. 310, 675), Biochem. biophys. res. commun., 308, 148-151, (2003)
[34] Chou, K.C.; Cai, Y.D.; Zhong, W.Z., Predicting networking couples for metabolic pathways of arabidopsis, Excli j., 5, 55-65, (2006)
[35] Chou, K.C.; Wei, D.Q.; Du, Q.S.; Sirois, S.; Zhong, W.Z., Review: progress in computational approach to drug development against SARS, Curr. med. chem., 13, 3263-3270, (2006)
[36] Devillers, J.; Balaban, A.T., Topological indices and related descriptors in QSAR and QSPR, (1999), Gordon and Breach The Netherlands
[37] Diederich, J., Artificial neural networks: concept learning, (1990), IEEE Press Piscataway NJ, USA · Zbl 0825.68506
[38] Dobson, P.D.; Doig, A.J., Distinguishing enzyme structures from non-enzymes without alignments, J. mol. biol., 330, 4, 771-783, (2003)
[39] Dobson, P.D.; Doig, A.J., Predicting enzyme class from protein structure without alignments, J. mol. biol., 345, 1, 187-199, (2005)
[40] Du, Q.S.; Mezey, P.G.; Chou, K.C., Heuristic molecular lipophilicity potential (HMLP): a 2D-QSAR study to LADH of molecular family pyrazole and derivatives, J. comput. chem., 26, 461-470, (2005)
[41] Du, Q.S.; Wang, S.Q.; Jiang, Z.Q.; Gao, W.N.; Li, Y.D.; Wei, D.Q.; Chou, K.C., Application of bioinformatics in search for cleavable peptides of SARS-cov mpro and chemical modification of octapeptides, Med. chem., 1, 209-213, (2005)
[42] Du, Q.S.; Huang, R.B.; Wei, Y.T.; Du, L.Q.; Chou, K.C., Multiple field three-dimensional quantitative structure – activity relationship (MF-3D-QSAR), J. comput. chem., 29, 211-219, (2008)
[43] Estrada, E., Generalization of topological indices, Chem. phys. lett., 336, 248-252, (2001)
[44] Estrada, E., Application of a novel graph-theoretic folding degree index to the study of steroid-DB3 antibody binding affinity, Comput. biol. chem., 27, 305-313, (2003)
[45] Estrada, E.; Delgado, E.J.; Alderete, J.B.; Jana, G.A., Quantum-connectivity descriptors in modeling solubility of environmentally important organic compounds, J. comput. chem., 25, 1787-1796, (2004)
[46] Frank, I.H.W.A.E., Data mining: practical machine learning tools and techniques, (2005), Kaufmann San Francisco
[47] Fujibuchi, W.; Goto, S.; Migimatsu, H.; Uchiyama, I.; Ogiwara, A.; Akiyama, Y.; Kanehisa, M., DBGET/linkdb: an integrated database retrieval system, Pac. symp. biocomput., 3, 681-692, (1997)
[48] Gao, W.N.; Wei, D.Q.; Li, Y.; Gao, H.; Xu, W.R.; Li, A.X.; Chou, K.C., Agaritine and its derivatives are potential inhibitors against HIV proteases, Med. chem., 3, 221-226, (2007)
[49] González, M.P.; Teran, C.; Teijeira, M.; Besada, P., Geometry, topology, and atom-weights assembly descriptors to predicting A1 adenosine receptors agonists, Bioorg. med. chem. lett., 15, 10, 2641-2645, (2005)
[50] Gonzalez, M.P.; Teran, C.; Teijeira, M.; Besada, P.; Gonzalez-Moa, M., BCUT descriptors to predicting affinity toward A3 adenosine receptors, J. bioorg. med. chem. lett., 15, 15, 3491-3495, (2005)
[51] Gonzalez-Diaz, H.; Cruz-Monteagudo, M.; Molina, R.; Tenorio, E.; Uriarte, E., Predicting multiple drugs side effects with a general drug-target interaction thermodynamic Markov model, Bioorg. med. chem., 13, 4, 1119-1129, (2005)
[52] Gonzalez-Diaz, H.; Prado-Prado, F.J.; Santana, L.; Uriarte, E., Unify QSAR approach to antimicrobials. part 1: predicting antifungal activity against different species, Bioorg. med. chem., 14, 5973-5980, (2006)
[53] Gonzalez-Diaz, H.; Perez-Bello, A.; Uriarte, E.; Gonzalez-Diaz, Y., QSAR study for mycobacterial promoters with low sequence homology, Bioorg. med. chem. lett., 16, 547-553, (2006)
[54] Gonzalez-Diaz, H.; Sanchez-Gonzalez, A.; Gonzalez-Diaz, Y., 3D-QSAR study for DNA cleavage proteins with a potential anti-tumor ATCUN-like motif, J. inorg. biochem., 100, 1290-1297, (2006)
[55] González-Díaz, H.; González-Díaz, Y.; Santana, L.; Ubeira, F.M.; Uriarte, E., Proteomics, networks, and connectivity indices, Proteomics, 8, 750-778, (2008)
[56] Gramatica, P.; Consolaro, F.; Pozzi, S., QSAR approach to POPs screening for atmospheric persistence, Chemosphere, 43, 4-7, 655-664, (2001)
[57] Hua, S.; Sun, Z., Support vector machine approach for protein subcellular localization prediction, Bioinformatics, 17, 721-728, (2001)
[58] Kabsch, W.; Sander, C., Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers, 22, 12, 2577-2637, (1983)
[59] Karelson, M., Molecular descriptors in QSAR/QSPR, (2000), Wiley-Interscience New York
[60] Kowalski, R.D.; Wold, S., (), 673-697
[61] Li, Y.; Wei, D.Q.; Gao, W.N.; Gao, H.; Liu, B.N.; Huang, C.J.; Xu, W.R.; Liu, D.K.; Chen, H.F.; Chou, K.C., Computational approach to drug design for oxazolidinones as antibacterial agents, Med. chem., 3, 576-582, (2007)
[62] Liao, B.; Ding, K., Graphical approach to analyzing DNA sequences, J. comput. chem., 26, 14, 1519-1523, (2005)
[63] Liao, B.; Wang, T.M., Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping triplets of nucleotide bases, J. chem. inf. comput. sci., 44, 5, 1666-1670, (2004)
[64] Liao, B.; Wang, T.M., New 2D graphical representation of DNA sequences, J. comput. chem., 25, 11, 1364-1368, (2004)
[65] Liao, B.; Xiang, X.; Zhu, W., Coronavirus phylogeny based on 2D graphical representation of DNA sequence, J. comput. chem., 27, 11, 1196-1202, (2006)
[66] Luco, J.M.; Ferretti, F.H., QSAR based on multiple linear regression and PLS methods for the anti-HIV activity of a large group of HEPT derivatives, J. chem. inf. comput. sci., 37, 2, 392-401, (1997)
[67] Randic, M., Condensed representation of DNA primary sequences, J. chem. inf. comput. sci., 40, 1, 50-56, (2000)
[68] Randic, M.; Balaban, A.T., On a four-dimensional representation of DNA primary sequences, J. chem. inf. comput. sci., 43, 2, 532-539, (2003)
[69] Randic, M.; Basak, S.C., Characterization of DNA primary sequences based on the average distances between bases, J. chem. inf. comput. sci., 41, 3, 561-568, (2001)
[70] Randic, M.; Vracko, M.; Nandy, A.; Basak, S.C., On 3-D graphical representation of DNA primary sequences and their numerical characterization, J. chem. inf. comput. sci., 40, 5, 1235-1244, (2000)
[71] Randic, M.; Zupan, J.; Vikic-Topic, D., On representation of proteins by star-like graphs, J. mol. graph model, 26, 1, 290-305, (2007)
[72] Ren, B.J., Novel atomic-level-based AI topological descriptors: application to QSPR/QSAR modeling, Chem. inf. comput. sci., 42, 4, 858-868, (2002)
[73] Shen, H.B.; Chou, K.C., Ezypred: a top-down approach for predicting enzyme functional classes and subclasses, Biochem. biophys. res. commun., 364, 53-59, (2007)
[74] Shen, H.B.; Chou, K.C., Pseaac: a flexible web-server for generating various kinds of protein pseudo amino acid composition, Anal. biochem., 373, 386-388, (2008), ⟨http://chou.med.harvard.edu/bioinf/PseAA/⟩
[75] Sirois, S.; Wei, D.Q.; Du, Q.S.; Chou, K.C., Virtual screening for SARS-cov protease based on KZ7088 pharmacophore points, J. chem. inf. comput. sci., 44, 1111-1122, (2004)
[76] StatSoft.Inc., 2002. p STATISTICA (data analysis software system), version 6.0. Available from: ⟨www.statsoft.com.Statsoft⟩.
[77] Todeschini, R.; Consonni, V., Handbook of molecular descriptors, (2000), Wiley VCH Weinheim, Germany
[78] Van Waterbeemd, H., 1995. Chemometric methods in molecular design. In: Manhnhold, R., Krogsgaard-Larsen, P., Timmerman, H., Van Waterbeemd, H., Weinhiem, V.C.H. (Eds.), Method and Principles in Medicinal Chemistry, 359pp.
[79] Wang, J.F.; Wei, D.Q.; Chen, C.; Li, Y.; Chou, K.C., Molecular modeling of two CYP2C19 SNPs and its implications for personalized drug design, Protein pept. lett., 15, 27-32, (2008)
[80] Zhang, R.; Wei, D.Q.; Du, Q.S.; Chou, K.C., Molecular modeling studies of peptide drug candidates against SARS, Med. chem., 2, 309-314, (2006)
[81] Zheng, H.; Wei, D.Q.; Zhang, R.; Wang, C.; Wei, H.; Chou, K.C., Screening for new agonists against Alzheimer’s disease, Med. chem., 3, 488-493, (2007)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.