Spectral concepts in genome informational analysis. (English) Zbl 07424423

Summary: The concept of \(k\)-spectrum for genomes is here investigated as a basic tool to analyze genomes. Related spectral notions based on \(k\)-mers are introduced with some related mathematical properties which are relevant for informational analysis of genomes. Procedures to generate spectral segmentations of genomes are provided and are tested (under several values of length \(k\) for \(k\)-mers) on cases of real genomes, such as some human chromosomes and Saccharomyces cerevisiae.


68Qxx Theory of computing


Full Text: DOI arXiv


[1] Abouelhoda, M. I.; Kurtz, S.; Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discret. Algorithms, 2, 53-86 (2004) · Zbl 1115.92303
[2] Acharya, J.; Das, H.; Milenkovic, O.; Orlitsky, A.; Pan, S., On reconstructing a string from its substring compositions, (IEEE International Symposium on Information Theory (2010)), 1238-1242
[3] Acharya, J.; Das, H.; Milenkovic, O.; Orlitsky, A.; Pan, S., String reconstruction from substring compositions, SIAM J. Discrete Math., 29, 3, 1340-1371 (2015) · Zbl 1320.68228
[4] Berstel, J.; Karhumaki, J., Combinatorics on words - a tutorial (2003) · Zbl 1169.68560
[5] Bonnici, V.; Manca, V., Infogenomics tools: a computational suite for informational analysis of genomes, J. Bioinform. Proteomics Rev., 1, 8-14 (2015)
[6] Bonnici, V.; Manca, V., Recurrence distance distributions in computational genomics, Am. J. Bioinform. Comput. Biol., 3, 5-23 (2015)
[7] Bonnici, V.; Manca, V., Informational laws of genome structures, Sci. Rep., 6, Article 28840 pp. (2016)
[8] Bonnici, V.; Manca, V., An informational test for random finite strings, Entropy, 20, 12, 934 (2018)
[9] Brijder, R.; Daley, M.; Harju, T.; Jonoska, N.; Petre, I.; Rozenberg, G., The computational nature of gene assembly in ciliates, (Rozenberg, G.; Bäck, T.; Kok, J. N., Handbook of Natural Computing, vol. 3 (2012), Springer), 1233-1281
[10] Burns, J.; Kukushkin, D.; Lindblad, K.; Chen, X.; Jonoska, N.; Landweber, L. F., A database of ciliate genome rearrangements, Nucleic Acids Res., 44, 703-709 (2016)
[11] Castellini, A.; Franco, G.; Manca, V., A dictionary based informational genome analysis, BMC Genomics, 13, 485 (2012)
[12] Castellini, A.; Franco, G.; Milanese, A., A genome analysis based on repeat sharing gene networks, Nat. Comput., 14, 3, 403-420 (2015)
[13] Compeau, P. E.C.; Pevzner, P. A.; Glenn Tesler, G., Why are de Bruijn graphs useful for genome assembly?, Nat. Biotechnol., 29, 11, 987-991 (2011)
[14] Dudik, M.; Schulman, L., Reconstruction from subsequences, J. Comb. Theory, 103, 2, 337-348 (2003) · Zbl 1039.68098
[15] Franco, G., Perspectives in computational genome analysis, (Jonoska, N.; Saito, M., Discrete and Topological Models in Molecular Biology. Discrete and Topological Models in Molecular Biology, Natural Computing Series (2014), Springer: Springer Berlin, Heidelberg)
[16] Franco, G.; Jonoska, N., Forbidding - enforcing conditions in DNA self-assembly of graphs, (Chen, J.; Jonoska, N.; Rozenberg, G., Nanotechnology, Science and Computation (2006), Springer), 105-118
[17] Franco, G.; Milanese, A., An investigation on genomic repeats, (The Nature of Computation. Logic, Algorithms, Applications. The Nature of Computation. Logic, Algorithms, Applications, Proceedings of 9th Conference on Computability in Europe, CiE 2013, Milano, July 1-5, 2013. The Nature of Computation. Logic, Algorithms, Applications. The Nature of Computation. Logic, Algorithms, Applications, Proceedings of 9th Conference on Computability in Europe, CiE 2013, Milano, July 1-5, 2013, LNCS, vol. 7921 (2013), Springer Verlag: Springer Verlag Germany), 149-160
[18] Franco, G.; Manca, V., Decoding genomic information, (Stepney, S.; Rasmussen, S.; Amos, M., Computational Matter. Computational Matter, Natural Computing Series (2018), Springer: Springer Cham) · Zbl 1444.92066
[19] Gabrys, R.; Milenkovic, O., Unique reconstruction of coded strings from multiset substring spectra, IEEE Trans. Inf. Theory, 65, 12, 7682-7696 (2019), IEEE Press · Zbl 1433.68630
[20] Holden, N.; Pemantle, R.; Peres, Y., Subpolynomial trace reconstruction for random strings and arbitrary deletion probability, (Proceedings of the 31st Conference on Learning Theory. Proceedings of the 31st Conference on Learning Theory, PMLR, vol. 75 (2018)), 1799-1840
[21] Kiah, H. M.; Puleo, G. J.; Milenkovic, O., Codes for DNA sequence profiles, IEEE Trans. Inf. Theory, 62, 6, 3125-3146 (2016) · Zbl 1359.94706
[22] Levenshtein, V., Efficient reconstruction of sequences from their subsequences or supersequences, J. Comb. Theory, Ser. A, 93, 2, 310-332 (2001) · Zbl 0992.68155
[23] Lan, M.; Tan, C. L.; Su, J.; Lu, Y., Supervised and traditional term weighting methods for automatic text categorization, IEEE Trans. Pattern Anal. Mach. Intell., 31, 4, 721-735 (2009)
[24] Manca, V., Infobiotics. Information in Biotic Systems (2013), Springer · Zbl 1278.68017
[25] Manca, V., Infogenomics, genomes as information sources, ((2016), Elsevier, Morgan Kauffman), 317-324, Chapter 21
[26] Manca, V., Information theory in genome analysis, (Membrane Computing. Membrane Computing, LNCS, vol. 9504 (2016), Springer), 3-18 · Zbl 1475.92113
[27] Manca, V., The principles of informational genomics, Theor. Comput. Sci., 701, 190-202 (2017) · Zbl 1383.92027
[28] Manca, V., A marvelous accident: the birth of life, J. Proteomics Bioinform., 11, 135-137 (2018)
[29] Manuch, J., Characterization of a word by its subwords, (Proc. of DLT (2000), World Scientific), 210-219 · Zbl 1013.68151
[30] Margaritis, D.; Skiena, S., Reconstructing strings from substrings in rounds, foundations of computer science, (Proceedings of IEEE 36th Annual Foundations of Computer Science. Proceedings of IEEE 36th Annual Foundations of Computer Science, Milwaukee, WI, USA (1995)), 613-620 · Zbl 0938.68601
[31] Marcovich, S.; Yaakobi, E., Reconstruction of strings from their substrings spectrum, (2020 IEEE International Symposium on Information Theory (ISIT). 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA (2020)), 658-663
[32] Motahari, A. S.; Bresler, G.; Tse, D., Information theory of DNA shotgun sequencing, IEEE Trans. Inf. Theory, 59, 10, 6273-6289 (2013) · Zbl 1364.94215
[33] Scott, A., Reconstructing sequences, Discrete Math., 175, 1-3, 231-238 (1997) · Zbl 0890.05003
[34] Shannon, C. E., A mathematical theory of communication, Bell Syst. Tech. J., 27, 623-656 (1948) · Zbl 1154.94303
[35] Sievers, A.; Wenz, F.; Hausmann, M.; Hildenbrand, G., Conservation of k-mer composition and correlation contribution between introns and intergenic regions of animalia genomes, Genes, 9, 10, 482 (2018)
[36] Simões, R.; Wolf, I.; Correa, B.; Valente, G., Uncovering patterns of the evolution of genomic sequence entropy and complexity, Mol. Gen. Genet. (October 2020)
[37] Yang, Z.; Li, H.; Jia, Y., Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes, BMC Evol. Biol., 20, 157 (2020)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.