×

A time series representation of protein sequences for similarity comparison. (English) Zbl 1483.92105

Summary: Based on the physicochemical indexes of 20 amino acids and the Hungarian algorithm, each amino acid was mapped into a vector. And, the protein sequence can be represented as time series in eleven-dimensional space. In addition, the DTW algorithm was applied to calculate the distance between two time series to compare the similarities of protein sequences. The validity and accuracy of this method was illustrated by similarity comparison of ND5 proteins of nine species. Furthermore, homology analysis of eleven ACE2 proteins, which included human, Malayan pangolin and six species of bats, confirmed that the human had shorter evolutionary distance from the pangolin than those bats. The phylogenetic tree of spike protein sequences of 36 coronaviruses, which were divided into five groups, Class I, Class II, Class III, SARS-CoVs and COVID-19, was constructed.

MSC:

92D20 Protein sequences, DNA sequences
92D15 Problems related to evolution

Software:

ClustalW; dtw
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J., Basic local alignment search tool, J. Mol. Biol., 215, 3, 403-410 (1990)
[2] Thompson, J. D.; Higgins, D. G.; Gibson, T. J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res., 22, 22, 4673-4680 (1994)
[3] Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, W. M., Alignment-free sequence comparison: benefits, applications, and tools, Genome Biol., 18, 186 (2017)
[4] Hamori, E.; Ruskin, J., H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences, J. Biol. Chem., 258, 2, 1318-1327 (1983)
[5] Gates, M. A., A simple way to look at DNA, J. Theor. Biol., 119, 3, 319-328 (1986)
[6] Jeffrey, H. J., Chaos game representation of gene structure, Nucleic Acids Res., 18, 8, 2163-2170 (1990)
[7] Nandy, A., A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes, Curr. Sci., 66, 309-314 (1994)
[8] Leong, P. M.; Morgenthaler, S., Random walk and gap plots of DNA sequences, Bioinformatics, 11, 5, 503-507 (1995)
[9] Hoang, T.; Yin, C.; Yau, S.-T., Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison, Genomics, 108, 3-4, 134-142 (2016)
[10] Jin, X.; Jiang, Q.; Chen, Y., Similarity/dissimilarity calculation methods of DNA sequences: a survey, J. Mol. Graph. Model., 76, 342-355 (2017)
[11] Yao, Y.-H.; Dai, Q.i.; Li, C.; He, P.-A.; Nan, X.-Y.; Zhang, Y.-Z., Analysis of similarity/dissimilarity of protein sequences, Proteins, 73, 4, 864-871 (2008)
[12] Ma, T.; Liu, Y.; Dai, Q.; Yao, Y.; He, P. A., A graphical representation of protein based on a novel iterated function system, Phys. A Statist. Mech. its Appl., 403, 21-28 (2014)
[13] Hu, H.; Li, Z.; Dong, H.; Zhou, T., Graphical representation and similarity analysis of protein sequences based on fractal interpolation, IEEE/ACM Trans. Comput. Biol. Bioinf., 14, 1, 182-192 (2017)
[14] Yao, Y. H.; Yan, S.; Han, J.; Dai, Q.; He, P. A., A novel descriptor of protein sequences and its application, J. Theor. Biol., 347, 109-117 (2014) · Zbl 1412.92251
[15] He, P.-A.; Xu, S.; Dai, Q.i.; Yao, Y., A generalization of CGR representation for analyzing and comparing protein sequences, Int. J. Quantum Chem., 116, 6, 476-482 (2016)
[16] Zhang, Y.; Ruan, J.; He, P. A., Analyzes of the similarities of protein sequences based on the pseudo amino acid composition, Chem. Phys. Lett., 590, 239-244 (2013)
[17] Li, C.; Li, X.; Lin, Y. X., Numerical characterization of protein pequences based on the generalized Chou’s pseudo amino acid composition, Appl. Sci., 6, 406 (2016)
[18] Wu, C.; Gao, R.; De Marinis, Y.; Zhang, Y., A novel model for protein sequence similarity analysis based on spectral radius, J. Theor. Biol., 446, 61-70 (2018) · Zbl 1397.92545
[19] Mervat, M. A.; Marwa, A. A.; Moheb, I. A.; Jiangke, Y., Measuring similarity among protein sequences using a new descriptor, Biomed Res. Int., 22, 2796971 (2019)
[20] Abd Elwahaab, M. A.; Abo-Elkhier, M. M.; Abo el Maaty, M. I., A statistical similarity/dissimilarity analysis of protein sequences based on a novel group representative vector, Biomed Res. Int., 2019, 1-9 (2019)
[21] Lochel, H.F., Eger, D., Sperlea, T., Heider, D., 2020. Deep learning on chaos game representation for proteins. Bioinformatics. 36, 272-279. 10.1093/bioinformatics/btz493.
[22] Mu, Z.; Yu, T.; Liu, X.; Zheng, H.; Wei, L.; Liu, J., FEGS: a novel feature extraction model for protein sequences and its applications, BMC Bioinf., 22, 1 (2021)
[23] Chopra, S.; Notarstefano, G.; Rice, M.; Egerstedt, M., A distributed version of the hungarian method for multirobot assignment, IEEE Trans. Rob., 33, 4, 932-947 (2017)
[24] Talkin, D., Fundamentals of speech synthesis and speech recognition, Lang. Speech, 39, 1, 91-94 (1996)
[25] Giorgino, T., Computing and visualizing dynamic time warping alignments in R: the DTW Package, J. Stat. Softw., 31, 1-24 (2009)
[26] Li, K.B., 2003. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics. 19, 1585-1586. 10.1093/bioinformatics/btg192.
[27] Zhou, P.; Yang, X. L.; Wang, X. G., A pneumonia outbreak associated with a new coronavirus of probable bat origin, Nature, 579, 270-273 (2020)
[28] Wrapp, D.; Wang, N.; Corbett, K. S.; Goldsmith, J. A.; Hsieh, C.-L.; Abiona, O.; Graham, B. S.; McLellan, J. S., Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation, Science, 367, 6483, 1260-1263 (2020)
[29] Lam, T. T.; Jia, N.; Zhang, Y. W., Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins, Nature, 583, 282-285 (2020)
[30] Lopes, L. R.; de Mattos Cardillo, G.; Paiva, P. B., Molecular evolution and phylogenetic analysis of SARS-CoV-2 and hosts ACE2 protein suggest Malayan pangolin as intermediary host, Braz. J. Microbiol., 51, 4, 1593-1599 (2020)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.