Decoding genomic information. (English) Zbl 1444.92066

Stepney, Susan (ed.) et al., Computational matter. Cham: Springer. Nat. Comput. Ser., 129-149 (2018).
Summary: Genomes carry the main information generating life of organisms and their evolution. They work in nature as a marvellous operative system of molecular (reading, writing and signal transmission) rules, orchestrating all cell functions and information transmission to cell daughters. As long polymers of nucleotides, they may be seen as a special book which reports in its own sequence all developments it had passed through during evolution. All fragments which were mutated, duplicated, assembled, silenced are still present in the genomic sequence to some extent, to form genomic dictionaries.
Here we outline some trends of research which analyse and interpret (i.e., decode) genomic information, by assuming the genome to be a book encrypted in an unknown language, which has still to be deciphered, while directly affecting the structure and the interaction of all the cellular and multicellular components. We focus on an informational analysis of real genomes, which may be framed within a new trend of computational genomics, lying across bioinformatics and natural computing. This analysis is performed by sequence alignment-free methods, based on information theoretical concepts, in order to convert the genomic information into a comprehensible mathematical form and to understand its complexity.
After a nutshell of the state of the art, given as a brief overview of approaches in the area, we present our viewpoint and results on genomic wide studies, by means of mathematical distributions and dictionary-based analysis inspired by information theory, where normalized multiplicities of genomic words are frequencies defining discrete probability distributions of interest. The definition, computation, and analysis of a few informational indexes have highlighted some properties of genomic regularity and specificity, which may be a basis for the comprehension of evolutional and functional aspects of genomes.
For the entire collection see [Zbl 1443.68021].


92D10 Genetics and epigenetics
92-08 Computational methods for problems pertaining to biology


genome decoding


Full Text: DOI


[1] Almirantis, Y., P. Arndt, W. Li, and A. Provata (2014). “Editorial: Complexity in genomes”.Comp. Biol. Chem.53:1-4.
[2] Annaluru, N., H. Muller, L. A. Mitchell, et al. (2014). “Total synthesis of a functional designer eukaryotic chromosome”.Science344(6186):816.
[3] Bonnici, V. and V. Manca (2015a). “Infogenomics tools: a computational suite for informational analysis of genomes”.Bioinform. Proteomics Rev. 1(1):7-14.
[4] Bonnici, V. and V. Manca (2015b). “Recurrence distance distributions in computational genomics”.Am. J. Bioinformatics and Computational Biology3(1):5-23.
[5] Bonnici, V. and V. Manca (2016). “Informational laws of genome structures”. Nature Scientific Reports6:28840.
[6] Cai, Y. et al. (2007). “A syntactic model to design and verify synthetic genetic constructs derived from standard biological parts”.Briefings in Bioinformatics23(20):2760-2767.
[7] Castellini, A., G. Franco, and V. Manca (2012). “A dictionary based informational genome analysis”.BMC Genomics13(1):485.
[8] Castellini, A., G. Franco, V. Manca, R. Ortolani, and A. Vella (2014). “Towards an MP model for B lymphocytes maturation”.Unconventional Computation and Natural Computation (UCNC). Vol. 8553. LNCS. Springer, pp. 80-92. · Zbl 06481732
[9] Castellini, A., G. Franco, and A. Milanese (2015). “A genome analysis based on repeat sharing gene networks”.Natural Computing14(3):403-420.
[10] Castellini, A., G. Franco, and R. Pagliarini (2011). “Data analysis pipeline from laboratory to MP models”.Natural Computing10(1):55-76.
[11] Chor, B., D. Horn, N. Goldman, et al. (2009). “Genomic DNAk-mer spectra: models and modalities”.Genome Biology10:R108.
[12] Cicalese, F. (2016).Fault-tolerant search algorithms: reliable computation with unreliable information. Springer. · Zbl 1295.68006
[13] Cicalese, F., P. Erd¨os, and Z. Lipt´ak (2011). “Efficient reconstruction of RCequivalent strings”.IWOCA 2010. Vol. 6460. LNCS. Springer, pp. 349- 62.
[14] Computational Pan-Genomics Consortium (2018). “Computational pan-genomics: status, promises and challenges”.Briefings in Bioinformatics19(1) :118-135.
[15] Conrad, M. (1988).The price of programmability. The Universal Turing Machine A Half-Century Survey. Oxford University Press.
[16] Consortium, International Human Genome Sequencing (2001). “Initial sequencing and analysis of the human genome”.Nature409:860-921.
[17] Crochemore, M. and R. V´erin (1999). “Zones of low entropy in genomic sequences”.Computers & chemistry23:275-282.
[18] Deschavanne, P.J., A. Giron, J. Vilain, G. Fagot, and B. Fertil (1999). “Genomic Signature: Characterization and Classification of Species Assessed by Chaos Game Representation of Sequences”.Mol. Biol. Evol.16(10) :1391-1399.
[19] Dunham, I., A. Kundaje, S. Aldred, and the ENCODE Project Consortium (2012). “An integrated encyclopedia of DNA elements in the human genome”.Nature489:57-74.
[20] Fabris, F. (2002). “Shannon information theory and molecular biology”.J. Interdisc Math5:203-220.
[21] Fici, G., F. Mignosi, A. Restivo, et al. (2006). “Word assembly through minimal forbidden words”.Theoretical Computer Science359:214-230. · Zbl 1097.68108
[22] Fisher, R.A. (1958).The Genetical Theory of Natural Selection. 2nd edn. Dover. · JFM 56.1106.13
[23] Fofanov, Y., Y. Luo, C. Katili, et al. (2008). “How independent are the appearances ofn-mers in different genomes?”Bioinformatics20(15):2421- 2428.
[24] Franco, G. (2005). “A polymerase based algorithm for SAT”.ICTCS. Vol. 3701. LNCS. Springer, pp. 237-250. · Zbl 1171.68478
[25] Franco, G. (2014). “Perspectives in computational genome analysis”.Discrete and Topological Models in Molecular Biology. Ed. by N. Jonoska and M. Saito. Springer. Chap. 1, pp. 3-22.
[26] Franco, G., N. Jonoska, B. Osborn, and A. Plaas (2008). “Knee joint injury and repair modeled by membrane systems”.BioSystems91(3):473-88.
[27] Franco, G. and V. Manca (2004). “A membrane system for the leukocyte selective recruitment”.Membrane Computing. Vol. 2933. LNCS. Springer, pp. 181-190. · Zbl 1202.68190
[28] Franco, G. and V. Manca (2011a). “Algorithmic applications of XPCR”. Natural Computing10(2):805-819. · Zbl 1217.92046
[29] Franco, G. and V. Manca (2011b). “On Synthesizing Replicating Metabolic Systems”.ERCIM News 85 - Unconventional Computing Paradigms. Ed. by Peter Kunz. European Research Consortium for Informatics and Mathematics. Chap. 21, pp. 21-22.
[30] Franco, G. and A. Milanese (2013). “An investigation on genomic repeats”. Conference on Computability in Europe - CiE. Vol. 7921. LNCS. Springer, pp. 149-160.
[31] Gatlin, L. (1966). “The information content of DNA”.J. Theor Biol10(2) :281-300.
[32] Giancarlo, R., D. Scaturro, and F. Utro (2009). “Textual data compression in computational biology: a synopsis”.Bioinformatics25(13):1575-86. · Zbl 1298.68087
[33] Gibson, D. G. et al. (2010). “Creation of a bacterial cell controlled by a chemically synthesized genome”.Science329(5987):52-56.
[34] Gibson, D. G. et al. (2014). “Synthetic Biology: Construction of a Yeast Chromosome”.Nature509:168-169.
[35] Ginsburg, G. S. and H. F. Willard, eds. (2017).Genomic and Precision Medicine - Foundations, Translation, and Implementation. 3rd edn. Elsevier.
[36] Hampikian, G. and T. Andersen (2007). “Absent sequences: nullomers and primes”.Pacific Symposium on Biocomputing12:355-366.
[37] Herold, J., S. Kurtz, and R. Giegerich (2008). “Efficient computation of absent words in genomic sequences”.BMC Bioinformatics9(5987):167.
[38] Holland, J. H. (1998).Emergence: from chaos to order. Perseus Books. · Zbl 1016.00010
[39] Kong, S. G., H.-D. Chen W.-L. Fan, et al. (2009). “Quantitative measure of randomness and order for complete genomes”.Phys Rev E79(6):061911.
[40] Li, Z., H. Cao, Y. Cui, and Y. Zhang (2016). “Extracting DNA words based on the sequence features: non-uniform distribution and integrity”.Theoretical Biology and Medical Modelling13(1):2.
[41] Lothaire, M. (1997).Combinatorics on Words. Cambridge University Press. · Zbl 0874.20040
[42] Lynch, M. and J. S. Conery (2003). “The origins of genome complexity”. Science302:1401-1404.
[43] Manca, V. (2013).Infobiotics - Information in Biotic Systems. Springer. · Zbl 1278.68017
[44] Manca, V. (2015). “Information Theory in genome analysis”.Conference on Membrane Computing (CMC). Vol. 9504. Lecture Notes in Computer Science. Berlin, Germany: Springer, pp. 3-18. · Zbl 1475.92113
[45] Manca, V. (2016). “Infogenomics: genomes as information sources”.Emerging Trends in Applications and Infrastructures for Computational Biology, Bioinformatics, and Systems Biology. Ed. by Q. N. Tran and H. R. Arabnia. Elsevier. Chap. 21, pp. 317-323.
[46] Manca, V. (2017). “The principles of informational genomics”.Theoretical Computer Science701:190-202. · Zbl 1383.92027
[47] Manca, V., A. Castellini, G. Franco, L. Marchetti, and R. Pagliarini (2013). “Metabolic P systems: A discrete model for biological dynamics”.Chinese Journal of Electronics22(4):717-723.
[48] Manca, V. and G. Franco (2008). “Computing by polymerase chain reaction”. Mathematical Bioscience211(2):282-298. · Zbl 1130.92026
[49] Mantegna, R.N. et al. (1994). “Linguistic Features of Noncoding DNA Sequences”.Physical Review Letters73(23):3169-3172.
[50] Neph, S., J. Vierstra, A. Stergachis, et al. (2012). “An expansive human regulatory lexicon encoded in transcription factor footprints”.Nature489 :83-90.
[51] P˘aun, G. (2016). “Looking for Computers in the Biological Cell. After Twenty Years”.Advances in Unconventional Computing, volume 1: Theory. Ed. by A. Adamatzky. Springer, pp. 805-853.
[52] Percus, J. K. (2007).Mathematics of Genome Analysis. Cambridge University Press. · Zbl 1018.92008
[53] Ratner, T., R. Piran, N. Jonoska, and E. Keinan (2013). “Biologically Relevant Molecular Transducer with Increased Computing Power and Iterative Abilities”.Chemistry & Biology20(5):726-733.
[54] Rothemund, P. W. K, N. Papadakis, and E. Winfree (2004). “Algorithmic Self-Assembly of DNA Sierpinski Triangles”.PLoS Biology2(12):2041- 2053.
[55] Sadovsky, M., J.A. Putintseva, and A. S. Shchepanovsky (2008). “Genes, information and sense: Complexity and knowledge retrieval”.Theory in Biosciences127(2):69-78.
[56] Searls, D. B. (2002). “The language of genes”.Nature420:211-217.
[57] Sims, G. E., S.R. Jun, G. A. Wu, and S.H. Kim (2009). “Alignment-free genome comparison with feature frequency profiles (FFP) andoptimal resolutions”.PNAS106(8):2677-2682.
[58] Spivakov, M., J. Akhtar, P. Kheradpour, et al. (2012). “Analysis of variation at transcription factor binding sites in Drosophila and humans”.Genome Biology13:R49.
[59] Thomas, A. and T. M. Cover (1991).Elements of Information Theory. John Wiley. · Zbl 0762.94001
[60] Venter, C. et al. (2016). “Design and synthesis of a minimal bacterial genome”.Science351:6280.
[61] Vinga, S. (2013). “Information theory applications for biological sequence analysis”.Briefings in Bioinformatics15(3):376-389.
[62] Vinga, S. and J. Almeida (2003). “Alignment-free sequence comparison—a review”.Bioinformatics19(4):513-523.
[63] Vinga, S. and J. Almeida (2007). “Local Renyi entropic profiles of DNA sequences”.BMC Bioinformatics8:393.
[64] Wang, D., J. Xu, and J. Yu (2015). “KGCAK: ak-mer based database for genome-wide phylogeny and complexity evaluation”.Biol direct10(1):1-5.
[65] Zhang, Z.D., A. Paccanaro, Y. Fu, et al. (2007). “Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions”.Genome Res.17(6):787-797.
[66] Zheng, Y., H. Li, Y. Wang, et al. (2017). “Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast”.Chromosome Research25(2):173-189.
[67] Zhou, F.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.