zbMATH — the first resource for mathematics

Alignment free comparison: similarity distribution between the DNA primary sequences based on the shortest absent word. (English) Zbl 1336.92030
Summary: This work proposes an alignment free comparison model for the DNA primary sequences. In this paper, we treat the double strands of the DNA rather than single strand. We define the shortest absent word of the double strands between the DNA sequences and some properties are studied to speed up the algorithm for searching the shortest absent word. We present a novel model for comparison, in which the similarity distribution is introduced to describe the similarity between the sequences. A distance measure is deduced based on the Shannon entropy meanwhile is used in phylogenetic analysis. Some experiments show that our model performs well in the field of sequence analysis.

92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
Full Text: DOI
[1] Blaisdell, B.E., A measure of the similarity of sets of sequences not requiring sequence alignment, Proc. natl. acad. sci. USA, 83, 5155, (1986) · Zbl 0592.92011
[2] Cao, Y.; Janke, A.; Waddell, P.J.; Westerman, M.; Takenaka, O.; Murata, S.; Okada, N.; Paabo, S.; Hasegawa, M., Conflict among individual mitochondrial proteins in resolving the phylogeny of Eutherian orders, J. mol. evol., 47, 307-322, (1998)
[3] Chang, G.S.; Wang, T.M., Phylogenetic analysis of protein sequences based on distribution of length about common substring, Protein J., 30, 167-172, (2011)
[4] Chou, K.C., Insights from modeling three-dimensional structures of the human potassium and sodium channels, J. proteome res., 3, 856-861, (2004)
[5] Chou, K.C., Some remarks on protein attribute prediction and pseudo amino acid composition, J. theor. biol., 273, 236-247, (2011) · Zbl 1405.92212
[6] Chou, K.C.; Cai, Y.D., Predicting protein – protein interactions from sequences in a hybridization space, J. proteome res., 5, 316-322, (2006)
[7] Chou, K.C.; Shen, H.B., Memtype-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through pse-PSSM, Biochem. biophys. res. commun., 360, 339-345, (2007)
[8] Chou, K.C.; Shen, H.B., Protident: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information, Biochem. biophys. res. commun., 376, 321-325, (2008)
[9] Chou, K.C.; Shen, H.B., Recent advances in developing web-servers for predicting protein attributes, Nat. sci., 1, 63-92, (2009)
[10] Chou, K.C.; Wu, Z.C.; Xiao, X.A., Iloc-euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, Plos one, 6, (2011)
[11] Chou, K.C.; Liu, W.M.; Maggiora, G.M.; Zhang, C.T., Prediction and classification of domain structural classes, Proteins, 31, 97-103, (1998)
[12] Ding, Y.S.; Zhang, T.L.; Gu, Q.; Zhao, P.Y.; Chou, K.C., Using maximum entropy model to predict protein secondary structure with single sequence, Protein pept. lett., 16, 552-560, (2009)
[13] Domazet-Loso, M.; Haubold, B., Alignment-free detection of local similarity among viral and bacterial genomes, Bioinformatics, 27, 1466-1472, (2011)
[14] Du, P.; Cao, S.; Li, Y., Subchlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm, J. theor. biol., 261, 330-335, (2009)
[15] Du, P.; He, T.; Li, Y., Prediction of C-to-U RNA editing sites in higher plant mitochondria using only nucleotide sequence features, Biochem. biophys. res. commun., 358, 336-341, (2007)
[16] Du, P.; Jia, L.; Li, Y., CURE-chloroplast: a chloroplast C-to-U RNA editing predictor for seed plants, BMC bioinformatics, 10, 135, (2009)
[17] Du, P.; Li, T.; Wang, X., Recent progress in predicting protein sub-subcellular locations, Expert rev. proteomics, 8, 391-404, (2011)
[18] Du, Q.S.; Huang, R.B.; Chou, K.C., Advances in visual representation of molecular potentials, Expert opin. drug discovery, 5, 513-527, (2010)
[19] Garcia, S.P.; Pinho, A.J.; Rodrigues, J.; Bastos, C.A.C.; Ferreira, P., Minimal absent words in prokaryotic and eukaryotic genomes, Plos one, 6, (2011)
[20] Guyon, F.; Brochier-Armanet, C.; Guenoche, A., Comparison of alignment free string distances for complete genome phylogeny, Adv. data anal. classification, 3, 95-108, (2009) · Zbl 1284.92073
[21] Haubold, B.; Reed, F.A.; Pfaffelhuber, P., Alignment-free estimation of nucleotide diversity, Bioinformatics, 27, 449-455, (2011)
[22] Haubold, B.; Pfaffelhuber, P.; Domazet-Loso, M.; Wiehe, T., Estimating mutation distances from unaligned genomes, J. comput. biol., 16, 1487-1500, (2009)
[23] He, P.A.; Zhang, Y.P.; Yao, Y.H.; Tang, Y.F.; Nan, X.Y., The graphical representation of protein sequences based on the physicochemical properties and its applications, J. comput. chem., 31, 2136-2142, (2010)
[24] Huang, T.; Shi, X.H.; Wang, P.; He, Z.S.; Feng, K.Y.; Hu, L.L.; Kong, X.Y.; Li, Y.X.; Cai, Y.D.; Chou, K.C., Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks, Plos one, 5, (2010)
[25] Jones, D.T., Protein secondary structure prediction based on position-specific scoring matrices, J. mol. biol., 292, 195-202, (1999)
[26] Jun, S.R.; Sims, G.E.; Wu, G.H.A.; Kim, S.H., Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution, Proc. natl. acad. sci. USA, 107, 133-138, (2010)
[27] Kantorovitz, M.R.; Robinson, G.E.; Sinha, S., A statistical method for alignment-free comparison of regulatory sequences, Bioinformatics, 23, I249-I255, (2007)
[28] Li, X.; Liao, B.; Shu, Y.; Zeng, Q.; Luo, J., Protein functional class prediction using global encoding of amino acid sequence, J. theor. biol., 261, 290-293, (2009)
[29] Li, M.; Badger, J.H.; Chen, X.; Kwong, S.; Kearney, P.; Zhang, H.Y., An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17, 149-154, (2001)
[30] Liu, Z.; Liao, B.; Zhu, W.; Huang, G., A 2D graphical representation of DNA sequence based on dual nucleotides and its application, Int. J. quantum chem., 109, 948-958, (2009)
[31] Liao, B.; Shan, X.Z.; Zhu, W.; Li, R.F., Phylogenetic tree construction based on 2D graphical representation, Chem. phys. lett., 422, 282-288, (2006)
[32] Liao, B.; Wang, T.M., 3-D graphical representation of DNA sequences and their numerical characterization, Theochem—J. mol. struct., 681, 209-212, (2004)
[33] Liao, B.; Liao, B.Y.; Sun, X.M.; Zeng, Q.G., A novel method for similarity analysis and protein sub-cellular localization prediction, Bioinformatics, 26, 2678-2683, (2010)
[34] Otu, H.H.; Sayood, K., A new sequence distance measure for phylogenetic tree construction, Bioinformatics, 19, 2122-2130, (2003)
[35] Pham, T.D.; Zuegg, J., A probabilistic measure for alignment-free sequence comparison, Bioinformatics, 20, 3455-3461, (2004)
[36] Randic, M.; Vracko, M.; Lers, N.; Plavsic, D., Novel 2-D graphical representation of DNA sequences and their numerical characterization, Chem. phys. lett., 368, 1-6, (2003)
[37] Randic, M.; Zupan, J.; Balaban, A.T.; Vikic-Topic, D.; Plavsic, D., Graphical representation of proteins, Chem. rev., 111, 790-862, (2011)
[38] Reinert, G.; Chew, D.; Sun, F.Z.; Waterman, M.S., Alignment-free sequence comparison (I): statistics and power, J. comput. biol., 16, 1615-1634, (2009)
[39] Shen, H.B.; Chou, K.C., Signal-3L: a 3-layer approach for predicting signal peptides, Biochem. biophys. res. commun., 363, 297-303, (2007)
[40] Shen, H.B.; Chou, K.C., Hlvcleave: a web-server for predicting human immunodeficiency virus protease cleavage sites in proteins, Anal. biochem., 375, 388-390, (2008)
[41] Shen, H.B.; Chou, K.C., Identification of proteases and their types, Anal. biochem., 385, 153-160, (2009)
[42] Shen, H.B.; Chou, K.C., Predicting protein fold pattern with functional domain and sequential evolution information, J. theor. biol., 256, 441-446, (2009)
[43] Shen, H.B.; Yi, D.L.; Yao, L.X.; Yang, J.; Chou, K.C., Knowledge-based computational intelligence development for predicting protein secondary structures from sequences, Expert rev. proteomics, 5, 653-662, (2008)
[44] Sims, G.E.; Kim, S.H., Whole-genome phylogeny of Escherichia coli/shigella group by feature frequency profiles (FFPs), Proc. natl. acad. sci. USA, 108, 8329-8334, (2011)
[45] Sims, G.E.; Jun, S.R.; Wua, G.A.; Kim, S.H., Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions, Proc. natl. acad. sci. USA, 106, 2677-2682, (2009)
[46] Ulitsky, I.; Burstein, D.; Tuller, T.; Chor, B., The average common substring approach to phylogenomic reconstruction, J. comput. biol., 13, 336-350, (2006)
[47] Vinga, S.; Almeida, J., Alignment-free sequence comparison—a review, Bioinformatics, 19, 513-523, (2003)
[48] Wan, L.; Reinert, G.; Sun, F.Z.; Waterman, M.S., Alignment-free sequence comparison (II): theoretical power of comparison statistics, J. comput. biol., 17, 1467-1490, (2010)
[49] Wang, J.F.; Wei, D.Q.; Chou, K.C., Insights from investigating the interactions of adamantane-based drugs with the M2 proton channel from the H1N1 swine virus, Biochem. biophys. res. commun., 388, 413-417, (2009)
[50] Wang, T.; Yang, J.; Shen, H.B.; Chou, K.C., Predicting membrane protein types by the LLDA algorithm, Protein pept. lett., 15, 915-921, (2008)
[51] Wu, T.J.; Hsieh, Y.C.; Li, L.A., Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition, Biometrics, 57, 441-448, (2001) · Zbl 1209.62339
[52] Xiao, X.; Wang, P.; Chou, K.C., Predicting the quaternary structure attribute of a protein by hybridizing functional domain composition and pseudo amino acid composition, J. appl. crystallogr., 42, 169-173, (2009)
[53] Xiao, X.; Wu, Z.C.; Chou, K.C., A multi-label classifier for predicting the subcellular localization of Gram-negative bacterial proteins with both single and multiple sites, Plos one, 6, (2011)
[54] Xiao, X.A.; Wang, P.; Chou, K.C., GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions, Mol. biosyst., 7, 911-919, (2011)
[55] Yao, Y.H.; Nan, X.Y.; Wang, T.M., A class of 2D graphical representations of RNA secondary structures and the analysis of similarity based on them, J. comput. chem., 26, 1339-1346, (2005)
[56] Yao, Y.H.; Dai, Q.; Li, L.; Nan, X.Y.; He, P.A.; Zhang, Y.Z., Similarity/dissimilarity studies of protein sequences based on a new 2D graphical representation, J. comput. chem., 31, 1045-1052, (2010)
[57] Zhu, W.; Liao, B.; Li, R., A method for constructing phylogenetic tree based on a dissimilarity matrix, Match—commun. math. comput. chem., 63, 483-492, (2010)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.