LEGO-based generalized set of two linear algebraic 3D bio-macro-molecular descriptors: theory and validation by QSARs. (English) Zbl 1425.92148

Summary: Novel 3D protein descriptors based on bilinear, quadratic and linear algebraic maps in \(\mathbb{R}^n\) are proposed. The latter employs the \(k^{\mathrm{th}}2\)-tuple (dis) similarity matrix to codify information related to covalent and non-covalent interactions in these biopolymers. The calculation of the inter-amino acid distances is generalized by using several dis-similarity coefficients, where normalization procedures based on the simple stochastic and mutual probability schemes are applied. A new local-fragment approach based on amino acid-types and amino acid-groups is proposed to characterize regions of interest in proteins. Topological and geometric macromolecular cutoffs are defined using local and total indices to highlight non-covalent interactions existing between the side-chains of each amino acid. Moreover, local and total indices calculations are generalized considering a LEGO approach, by using several aggregation operators. Collinearity and variability analyses are performed to evaluate every generalizing component applied to the definition of these novel indices. These experiments are oriented to reduce the number of MDs obtained for performing prediction models. The predictive power of the proposed indices was evaluated using two benchmark datasets, folding rate and secondary structural classification of proteins. The proposed MDs are modeled using the following strategies: multiple linear regression (MLR) and support vector machine (SVM), respectively. The best regression model developed for the folding rate of proteins yields a cross-validation coefficient of 0.875 (test set) and the best model developed for secondary structural classification obtained 98% of instances correctly classified (test set). These statistical parameters are superior to the ones obtained with existing MDs reported in the literature. Overall, the new theoretical generalization enhanced the information extraction into the MDs, allowing a better correlation between these two evaluated benchmark datasets and the proposed indices. The optimal theoretical configurations defined for the calculation of these MDs consider low collinearity and less information redundancy among them. These theoretical configurations and the software are available at http://tomocomd.com/mulims-mcompas.


92D20 Protein sequences, DNA sequences
62P10 Applications of statistics to biology and medical sciences; meta analysis
92-08 Computational methods for problems pertaining to biology
Full Text: DOI


[1] Agüero-Chapin, G.; González-Díaz, H.; Molina, R.; Varona-Santos, J.; Uriarte, E.; González-Díaz, Y., Novel 2D maps and coupling numbers for protein sequences. the first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava l, FEBS Lett., 580, 723-730 (2006)
[2] Balaban, A., Local versus global (i.e. atomic versus molecular) numerical modeling of molecular graphs, J. Chem. Inf. Comput. Sci., 34, 398-402 (1994)
[3] (Balaban, A., Chemical Applications of Graph Theory (1976), Academic Press: Academic Press London)
[4] Balaban, A.; Bertelsen, S.; Basak, S. C., New centric topological indexes for acyclic molecules (trees) and substitents (rooted trees), and coding of rooted trees, MATCH - Commun. Math. Comput. Chem., 30, 55-72 (1994)
[5] Balaban, A.; Feroiu, V., Correlation between structure and critical data or vapor pressures of alkanes by means of topological indices, Reports Molec. Theory, 1, 130-139 (1990)
[6] Barigye, S.; Marrero-Ponce, Y.; Santiago, O.; Lopez, Y.; Perez-Gimenez, F.; Torrens, F., Shannon’s, mutual, conditional and joint entropy information indices: generalization of global indices defined from local vertex invariants, Curr. Comput. Aided-Drug Des., 9, 164-183 (2013)
[7] Barigye, S. J.; Marrero Ponce, Y.; Martínez-López, Y.; Torrens, F.; Artiles-Martínez, L. M.; Pino-Urias, R. W.; Martínez-Santiago, O., Relations frequency hypermatrices in mutual, conditional, and joint entropy-based information indices, J. Comput. Chem., 34, 259-274 (2012)
[8] Beliakov, G., How to build aggregation operators from data, Int. J. Intell. Syst., 18, 903-923 (2003) · Zbl 1074.68607
[9] Breda, A.; Valadares, N. F.; Souza, O. N.De; Garratt, R. C., Ch A06: protein structure, modelling and applications, Bioinforma. Trop. Dis. Res. A Pract. Case-Study Approach, 1-41 (2007)
[10] Cai, Y.-D.; Feng, K.-Y.; Lu, W.-C.; Chou, K.-C., Using logitboost classifier to predict protein structural classes, J. Theor. Biol., 238, 172-176 (2006)
[11] Cai, Y.-D.; Liu, X.-J.; Xu, X.; Chou, K.-C., Prediction of protein structural classes by support vector machines, Comput. Chem., 26, 293-296 (2002)
[12] Castillo-Garit, J. A.; Martinez-Santiago, O.; Marrero Ponce, Y.; Casañola-Martín, G. M.; Torrens, F.; Marrero-Ponce, Y.; Casañola-Martín, G. M.; Torrens, F., Atom-based non-stochastic and stochastic bilinear indices: application to QSPR/QSAR studies of organic compounds, Chem. Phys. Lett., 464, 107-112 (2008)
[13] Chen, K.; Kurgan, L. A.; Ruan, J., Prediction of protein structural class using novel evolutionary collocation-based sequence representation, J. Comput. Chem., 29, 1596-1604 (2008)
[14] Chou, K.-C., Progress in protein structural class prediction and its impact to bioinformatics and proteomics, Curr. Protein Pept. Sci. (2005)
[15] Chou, K.-C., A key driving force in determination of protein structural classes, Biochem. Biophys. Res. Commun., 264, 216-224 (1999)
[16] Chou, K.-C.; Shen, H.-B., FoldRate: a web-server for predicting protein folding rates from primary sequence, Open Bioinforma. J., 3, 31-50 (2009)
[17] Chou, P. Y.; Fasman, G. D., Conformational parameters for amino acids in helical, β-sheet, and random coil regions calculated from proteins, Biochemistry, 13, 211-222 (1974)
[18] Collantes, E. R.; Dunn, W. J., Amino acid side chain descriptors for quantitative structure-activity relationship studies of peptide analogues, J. Med. Chem., 38, 2705-2713 (1995)
[19] Cubillan, N.; Marrero-Ponce, Y.; Ariza-Rico, H.; Barigye, S.; García-Jacas, C.; Valdés Martiní, J.; Alvarado, Y., Novel global and local 3D atom-based linear descriptors of the Minkowski distance matrix: theory, diversity-variability analysis and QSPR applications, J. Math. Chem. (2015) · Zbl 1328.92090
[20] (Devillers, J.; Balaban, A., Topological Indices and Related Descriptors in QSAR and QSPR (1999), Gordon and Breach Science Publishers)
[21] Di Paola, L.; De Ruvo, M.; Paci, P.; Santoni, D.; Giuliani, A., Protein contact networks: an emerging paradigm in chemistry, Chem. Rev., 113, 1598-1613 (2013)
[22] Dorn, M.; Barbachan, M.; Buriol, L.; Lamb, L., Three-Dimensional protein structure prediction: methods and computational strategies, Comput. Biol. Chem. (2014)
[23] Du, P.; Wang, X.; Xu, C.; Gao, Y., PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions, Anal. Biochem., 425, 117-119 (2012)
[24] Fleming, P. J.; Richards, F. M., Protein packing: dependence on protein size, secondary structure and amino acid composition, J. Mol. Biol., 299, 487-498 (2000), 11Edited by F. E. Cohen
[25] García-Jacas, C.; Contreras-Torres, E.; Marrero-Ponce, Y.; Pupo-Meriño, M.; Barigye, S. J.; Cabrera-Leyva, L., Examining the predictive accuracy of the novel 3D N-linear algebraic molecular codifications on benchmark datasets, J. Cheminform., 8, 1-16 (2016)
[26] García-Jacas, C.; Marrero-Ponce, Y.; Acevedo-Martínez, L.; Barigye, S. J.; Valdés-Martiní, J. R.; Contreras-Torres, E., QuBiLS-MIDAS: a parallel free-software for molecular descriptors computation based on multilinear algebraic maps, J. Comput. Chem., 35, 1395-1409 (2014)
[27] Garcia-Jacas, C.; Marrero-Ponce, Y.; Barigye, S. J.; Valdes-Martin, J. R.; Rivera-Borroto, O. M.; Olivero-Verbel, J., N-Linear algebraic maps for chemical structure codification: a suitable generalization for atom-pair approaches?, Curr. Drug Metab, 15, 441-469 (2014)
[28] García-Jacas, C.; Marrero Ponce, Y.; Barigye, S. J.; Hernández-Ortega, T.; Cabrera-Leyva, L.; Fernández-Castillo, A., N-tuple topological/geometric cutoffs for 3D N-linear algebraic molecular codifications: variability, linear independence and QSAR analysis, SAR QSAR Environ. Res., 27, 949-975 (2016)
[29] Godden, J. W.; Stahura, F. L.; Bajorath, J., Variability of molecular descriptors in compound databases revealed by shannon entropy calculations, J. Chem. Inf. Comput. Sci., 40, 796-800 (2000)
[30] Gonzáles-Díaz, H.; Gia, O.; Uriarte, E.; Hernádez, I.; Ramos, R.; Chaviano, M.; Seijo, S.; Castillo, J. A.; Morales, L.; Santana, L.; Akpaloo, D.; Molina, E.; Cruz, M.; Torres, L. A.; Cabrera, M. A., Markovian chemicals “in silico” design (MARCH-INSIDE), a promising approach for computer-aided molecular design I: discovery of anticancer compounds, J. Mol. Model., 9, 395-407 (2003)
[31] Gonzalez-Diaz, H.; Vilar, S.; Santana, L.; Uriarte, E., Medicinal chemistry and bioinformatics - Current Trends in drugs discovery with networks topological indices, Curr. Top. Med. Chem., 7, 1015-1029 (2007)
[32] Gramatica, P., Principles of QSAR models validation: internal and external, QSAR Comb. Sci., 26, 694-701 (2007)
[33] Gromiha, M., Importance of native-state topology for determining the folding rate of two-state proteins, J. Chem. Inf. Comput. Sci., 43, 1481-1485 (2003)
[34] Gromiha, M.; Saraboji, K.; Ahmad, S.; Ponnuswamy, M. N.; Suwa, M., Role of non-covalent interactions for determining the folding rate of two-state proteins, Biophys. Chem., 107, 263-272 (2004)
[35] Gromiha, M.; Selvaraj, S., Comparison between long-range interactions and contact order in determining the folding rate of two-state proteins: application of long-range order to folding rate prediction, J. Mol. Biol., 310, 27-32 (2001)
[36] Gutman, I.; Polansky, O., Mathematical Concepts in Organic Chemistry (2012), Springer (https://www.springer.com/gp/book/9783642709845) · Zbl 0657.92024
[37] Hellberg, S.; Sjoestroem, M.; Skagerberg, B.; Wold, S., Peptide quantitative structure-activity relationships, a multivariate approach, J. Med. Chem., 30, 1126-1135 (1987)
[38] Hopp, T. P.; Woods, K. R., Prediction of protein antigenic determinants from amino acid sequences, Proc. Natl. Acad. Sci. USA, 78, 3824-3828 (1981)
[39] Kyte, J.; Doolittle, R. F., A simple method for displaying the hydropathic character of a protein, J. Mol. Biol., 157, 105-132 (1982)
[40] Léger, C.; Politis, D. N.; Romano, J. P., Bootstrap technology and applications, Technometrics, 34, 378-398 (1992) · Zbl 0850.62367
[41] Levitt, M.; Chothia, C., Structural patterns in globular proteins, Nature, 261, 552-558 (1976)
[42] Li, Z. R.; Lin, H. H.; Han, L. Y.; Jiang, L.; Chen, X.; Chen, Y. Z., PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence, Nucleic Acids Res., 34, W32-W37 (2006)
[43] Marrero Ponce, Y., Linear indices of the “Molecular pseudograph”s atom adjacency matrix“: definition, significance-interpretation, and application to QSAR analysis of flavone derivatives as HIV-1 integrase inhibitors, J. Chem. Inf. Comput. Sci., 44, 2010-2026 (2004)
[44] Marrero Ponce, Y.; Contreras-Torres, E.; García-Jacas, C.; Barigye, S. J.; Cubillán, N.; Alvarado, Y. J., Novel 3D bio-macromolecular bilinear descriptors for protein science: predicting protein structural classes, J. Theor. Biol., 374, 125-137 (2015) · Zbl 1341.92053
[45] Marrero Ponce, Y.; Garcia-Jacas, C.; Barigye, S. J.; Valdes-Martin, J. R.; Rivera-Borroto, O. M.; Pino-Urias, R. W.; Cubillan, N.; Alvarado, Y. J.; Le-Thi-Thu, H., Optimum search strategies or novel 3D molecular descriptors: is there a stalemate?, Curr. Bioinform., 10 (2015)
[46] Marrero Ponce, Y.; Medina-Marrero, R.; Castillo-Garit, J. A.; Romero-Zaldivar, V.; Torrens, F.; Castro, E. A., Protein linear indices of the ‘macromolecular pseudograph α-carbon atom adjacency matrix’ in bioinformatics. Part 1: prediction of protein stability effects of a complete set of alanine substitutions in Arc repressor, Bioorg. Med. Chem., 13, 3003-3015 (2005)
[47] Marrero Ponce, Y.; Medina Marrero, R.; Castro, E. A.; Ramos De Armas, R.; González Díaz, H.; Romero Zaldivar, V.; Torrens, F., Protein quadratic indices of the “macromolecular pseudograph”s α-carbon atom adjacency matrix“. 1. prediction of Arc repressor alanine-mutant”s stability, Molecules, 9, 1124-1147 (2004)
[48] Marrero Ponce, Y.; Torrens, F.; García-Domenech, R.; Ortega-Broche, S. E.; Zaldivar, V. R., Novel 2D TOMOCOMD-CARDD molecular descriptors: atom-based stochastic and non-stochastic bilinear indices and their QSPR applications, J. Math. Chem., 44, 650-673 (2008) · Zbl 1217.92095
[49] Martínez López, Y.; Marrero-Ponce, Y.; Echeverri Jaramillo, G.; Barigye, S., The summation of atomic contributions is an overly simplified characterization of the holistic molecular behavior, Lett. Drug Des. Discov. (2016)
[50] Massy, W. F., Principal components regression in exploratory statistical research, J. Am. Stat. Assoc., 60, 234-256 (1965)
[51] Mihalić, Z.; Trinajstić, N., A graph-theoretical approach to structure-property relationships, J. Chem. Educ., 69, 701 (1992)
[52] Murzin, A. G.; Brenner, S. E.; Hubbard, T.; Chothia, C., SCOP: a structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol., 247, 536-540 (1995)
[53] Nelson, D. L.; Cox, M. M., Lehninger Principles of Bichemistry (2017), Macmillan Learning: Macmillan Learning New York
[54] Nikolić, S.; Trinajstić, N.; Mihalić, Z.; Carter, S., On the geometric-distance matrix and the corresponding structural invariants of molecular systems, Chem. Phys. Lett., 179, 21-28 (1991)
[55] Ortega-Broche, S. E.; Marrero Ponce, Y.; Díaz, Y. E.; Torrens, F.; Pérez-Giménez, F., tomocomd-camps and protein bilinear indices - novel bio-macromolecular descriptors for protein research: I. Predicting protein stability effects of a complete set of alanine substitutions in the Arc repressor, FEBS J., 277, 3118-3146 (2010)
[56] Ouyang, Z.; Liang, J., Predicting protein folding rates from geometric contact and amino acid sequence, Protein Sci, 17, 1256-1263 (2008)
[57] Pino, R. W.; Barigye, S. J.; Marrero Ponce, Y.; García-Jacas, C.; Valdes-Martiní, J. R.; Perez-Gimenez, F., IMMAN: free software for information theory-based chemometric analysis, Mol. Divers., 19, 305-319 (2015)
[58] Plaxco, K. W.; Simons, K. T.; Baker, D., Contact order, transition state placement and the refolding rates of single domain proteins, J. Mol. Biol., 277, 985-994 (1998)
[59] Randić, M., Molecular shape profiles, J. Chem. Inf. Model, 35, 373-382 (1995)
[60] Rouvray, D. H., Computational Chemical Graph Theory (1990), Elsevier: Elsevier New York
[61] Ruiz-Blanco, Y. B.; Marrero Ponce, Y.; Prieto, P. J.; Salgado, J.; García, Y.; Sotomayor-Torres, C. M.; Garcia, Y.; Sotomayor Torres, C., A Hooke’s law-based approach to protein folding rate, J. Theor. Biol., 364, 407-417 (2015) · Zbl 1405.92221
[62] Sak, K.; Karelson, M.; Järv, J., Modeling of the amino acid side chain effects on peptide conformation, Bioorg. Chem., 27, 434-442 (1999)
[63] Sillero, A.; Ribeiro, J. M., Isoelectric points of proteins: theoretical determination, Anal. Biochem., 179, 319-325 (1989)
[64] Somorjai, R. L., Multivariate statistical methods, (Encyclopedia of Spectroscopy and Spectrometry (2016)), 962-966
[65] Todeschini, R.; Consonni, V., New local vertex invariants and molecular descriptors based on functions of the vertex degrees, MATCH - Commun. Math. Comput. Chem., 64, 359-372 (2010)
[66] Todeschini, R.; Consonni, V., Molecular descriptors for chemoinformatics, molecular descriptors for chemoinformatics, Methods and Principles in Medicinal Chemistry (2009), Wiley-VCH Verlag GmbH & Co. KGaA: Wiley-VCH Verlag GmbH & Co. KGaA Weinheim, Germany
[67] Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M., MobyDigs: software for regression and classification models by genetic algorithms, Data Handl. Sci. Technol. (2003)
[68] Tropsha, A., Best practices for QSAR model development, validation, and exploitation, Mol. Inform., 29, 476-488 (2010)
[69] Tropsha, A.; Gramatica, P.; Gombar, V. K., The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models, QSAR Comb. Sci., 22, 69-77 (2003)
[70] Valdés-Martiní, J. R.; Marrero-Ponce, Y.; García-Jacas, C. R.; Martinez-Mayorga, K.; Barigye, S. J.; Vaz D‘Almeida, Y. S.; Pham-The, H.; Pérez-Giménez, F.; Morell, C. A., QuBiLS-MAS, open source multi-platform software for atom- and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations, J. Cheminform., 9, 1-26 (2017)
[71] Willett, P., Chemoinformatics – similarity and diversity in chemical libraries, Curr. Opin. Biotechnol., 11, 85-88 (2000)
[72] Witten, I. H.; Frank, E.; Hall, M. A.; Pal, C. J., Chapter 5 - Credibility: Evaluating What’s Been Learned, (Witten, I. H.; Frank, E.; Hall, M. A.; Pal, C. J.B. T.-D. M. (2017), Morgan Kaufmann), 161-203
[73] Witten, I. H.; Frank, E.; Hall, M. A.; Pal, C. J.,, Appendix B - The WEKA workbench, Data Mining: Practical Machine Learning Tools and Techniques, 553-571 (2017), Morgan Kaufmann
[74] Zamyatnin, A. A., Protein volume in solution, Prog. Biophys. Mol. Biol., 24, 107-123 (1972)
[75] Zhang, T.-L.; Ding, Y.-S., Using pseudo amino acid composition and binary-tree support vector machines to predict protein structural classes, Amino Acids, 33, 623-629 (2007)
[76] Zhang, T.-L.; Ding, Y.-S.; Chou, K.-C., Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern, J. Theor. Biol., 250, 186-193 (2008) · Zbl 1397.92551
[77] Zhou, H.; Zhou, Y., Folding rate prediction using total contact distance, Biophys. J., 82, 458-463 (2002)
[78] Zhou, X.-B.; Chen, C.; Li, Z.-C.; Zou, X.-Y., Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes, J. Theor. Biol., 248, 546-551 (2007)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.