×

zbMATH — the first resource for mathematics

Graph algorithms for DNA sequencing – origins, current models and the future. (English) Zbl 1375.92041
Summary: With the ubiquitous presence of next-generation sequencing in modern biological, genetic, pharmaceutical and medical research, not everyone pays attention to the underlying computational methods. Even fewer researchers know what were the origins of the current models for DNA assembly. We present original graph models used in DNA sequencing by hybridization, discuss their properties and connections between them. We also explain how these graph models evolved to adapt to the characteristics of next-generation sequencing. Moreover, we present a practical comparison of state-of-the-art DNA de novo assembly tools representing these transformed models, i.e. overlap and decomposition-based graphs. Even though the competition is tough, some assemblers perform better and certainly large differences may be observed in hardware resources utilization. Finally, we outline the most important trends in the sequencing field, and try to predict their impact on the computational models in the future.
MSC:
92D20 Protein sequences, DNA sequences
92C40 Biochemistry, molecular biology
05C85 Graph algorithms (graph-theoretic aspects)
68R10 Graph theory (including graph drawing) in computer science
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Albertin, C.; Simakov, O.; Mitros, T.; Wang, Z.; Pungor, J.; Edsinger-Gonzales, E., The octopus genome and the evolution of cephalopod neural and morphological novelties, Nature, 7564, 524, 220-224, (2015)
[2] Apollonio, N.; Franciosa, P., A characterization of partial directed line graphs, Discrete Mathematics, 307, 2598-2614, (2007) · Zbl 1125.05087
[3] Bang-Jensen, J.; Gutin, G., Digraphs, theory, algorithms and applications, (2007), Springer-Verlag, Berlin
[4] Berge, C., Graphs and hypergraphs, (1973), North-Holland Publishing Company, London · Zbl 0483.05029
[5] Blazewicz, J.; Burke, E.; Kendall, G.; Mruczkiewicz, W.; Oguz, C.; Swiercz, A., A hyper-heuristic approach to sequencing by hybridization of DNA sequences, Annals of Operations Research, 207, 27-41, (2013) · Zbl 1272.90117
[6] Blazewicz, J.; Formanowicz, P.; Kasprzak, M.; Kobler, D., On the recognition of de Bruijn graphs and their induced subgraphs, Discrete Mathematics, 245, 81-92, (2002) · Zbl 0996.05061
[7] Blazewicz, J.; Formanowicz, P.; Kasprzak, M.; Markiewicz, W., Sequencing by hybridization with isothermic oligonucleotide libraries, Discrete Applied Mathematics, 145, 40-51, (2004) · Zbl 1056.92019
[8] Blazewicz, J.; Formanowicz, P.; Kasprzak, M.; Markiewicz, W.; Weglarz, J., DNA sequencing with positive and negative errors, Journal of Computational Biology, 6, 113-123, (1999)
[9] Blazewicz, J.; Frohmberg, W.; Gawron, P.; Kasprzak, M.; Kierzynka, M.; Swiercz, A., DNA sequence assembly involving an acyclic graph model, Foundations of Computing and Decision Sciences, 38, 25-34, (2013) · Zbl 1328.92052
[10] Blazewicz, J.; Frohmberg, W.; Kierzynka, M.; Pesch, E.; Wojciechowski, P., Protein alignment algorithms with an efficient backtracking routine on multiple gpus, BMC Bioinformatics, 12, 1, 1-17, (2011)
[11] Blazewicz, J.; Frohmberg, W.; Kierzynka, M.; Wojciechowski, P., G-MSA - A GPU-based, fast and accurate algorithm for multiple sequence alignment, Journal of Parallel and Distributed Computing, 73, 32-41, (2013)
[12] Blazewicz, J.; Hertz, A.; Kobler, D.; de Werra, D., On some properties of DNA graphs, Discrete Applied Mathematics, 98, 1-19, (1999) · Zbl 0939.05078
[13] Blazewicz, J.; Kasprzak, M., Complexity of DNA sequencing by hybridization, Theoretical Computer Science, 290, 1459-1473, (2003) · Zbl 1044.68065
[14] Blazewicz, J.; Kasprzak, M., Computational complexity of isothermic DNA sequencing by hybridization, Discrete Applied Mathematics, 154, 718-729, (2006) · Zbl 1092.68044
[15] Blazewicz, J.; Kasprzak, M., Complexity issues in computational biology, Fundamenta Informaticae, 118, 385-401, (2012) · Zbl 1244.92001
[16] Blazewicz, J.; Kasprzak, M., Reduced-by-matching graphs: toward simplifying Hamiltonian circuit problem, Fundamenta Informaticae, 118, 225-244, (2012) · Zbl 1244.05173
[17] Blazewicz, J.; Kasprzak, M.; Leroy-Beaulieu, B.; de Werra, D., Finding Hamiltonian circuits in quasi-adjoint graphs, Discrete Applied Mathematics, 156, 2573-2580, (2008) · Zbl 1165.05012
[18] Blazewicz, J.; Oguz, C.; Swiercz, A.; Weglarz, J., DNA sequencing by hybridization via genetic search, Operations Research, 54, 1185-1192, (2006) · Zbl 1167.68468
[19] Blazewicz, M.; Hinder, I.; Koppelman, D.; Brandt, S.; Ciznicki, M.; Kierzynka, M., From physics model to results: an optimizing framework for cross-architecture code generation, Scientific Programming, 21, 1-16, (2013)
[20] Blum, C.; Lozano, J.; Davidson, P., Mathematical programming strategies for solving the minimum common string partition problem, European Journal of Operational Research, 242, 769-777, (2015) · Zbl 1341.90107
[21] de Bruijn, N., A combinatorial problem, Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen, 49, 758-764, (1946) · Zbl 0060.02701
[22] Butler, J.; MacCallum, I.; Kleber, M.; Shlyakhter, I.; Belmonte, M.; Lander, E., ALLPATHS: de novo assembly of whole-genome shotgun microreads, Genome Research, 18, 810-820, (2008)
[23] Błżewicz, J.; Formanowicz, P.; Guinand, F.; Kasprzak, M., A heuristic managing errors for DNA sequencing, Bioinformatics, 18, 652-660, (2002)
[24] Ciznicki, M.; Kierzynka, M.; Kurowski, K.; Ludwiczak, B.; Napierala, K.; Palczynski, J., Efficient isosurface extraction using marching tetrahedra and histogram pyramids on multiple gpus, Lecture Notes in Computer Science, 7204, 343-352, (2012)
[25] Ferragina, P.; Manzini, G., Opportunistic data structures with applications, Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 390-398, (2000)
[26] Frohmberg, W.; Kierzynka, M.; Blazewicz, J.; Gawron, P.; Wojciechowski, P., G-DNA - a highly efficient multi-GPU/MPI tool for aligning nucleotide reads, Bulletin of the Polish Academy of Sciences: Technical Sciences, 61, 989-992, (2013)
[27] Frohmberg, W.; Kierzynka, M.; Blazewicz, J.; Wojciechowski, P., G-PAS 2.0 - an improved version of protein alignment tool with an efficient backtracking routine on multiple gpus, Bulletin of the Polish Academy of Sciences: Technical Sciences, 60, 491-494, (2012)
[28] Garey, M.; Johnson, D., Computers and intractability: A guide to the theory of NP-completeness, A series of books in the mathematical sciences, (1979), W.H. Freeman San Francisco
[29] Hao, J., The adjoints of DNA graphs, Journal of Mathematical Chemistry, 37, 333-346, (2005) · Zbl 1083.92013
[30] Kajitani, R.; Toshimoto, K.; Noguchi, H.; Toyoda, A.; Ogura, Y.; Okuno, M., Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads, Genome Research, 24, 1384-1395, (2014)
[31] Kapun, E.; Tsarev, F., De Bruijn superwalk with multiplicities problem is NP-hard, BMC Bioinformatics, 14, 5, 1-4, (2013)
[32] Kierzynka, M.; Kosmann, L.; vor dem Berge, M.; Krupop, S.; Hagemeyer, J.; Griessl, R., Energy efficiency of sequence alignment tools - software and hardware perspectives, Future Generation Computer Systems, (2016)
[33] Koren, S.; Phillippy, A., One chromosome, one contig: complete microbial genomes from long-Read sequencing and assembly, Current Opinion in Microbiology, 23, 110-120, (2015)
[34] Langmead, B.; Salzberg, S., Fast gapped-Read alignment with bowtie 2, Nature Methods, 9, 357-359, (2012)
[35] Li, R.; Zhu, H.; Ruan, J.; Qian, W.; Fang, X.; Shi, Z., De novo assembly of human genomes with massively parallel short Read sequencing, Genome Research, 20, 265-272, (2010)
[36] Li, X.; Zhang, H., Characterizations for some types of DNA graphs, Journal of Mathematical Chemistry, 42, 65-79, (2007) · Zbl 1119.92074
[37] Li, X.; Zhang, H., Embedding on alphabet overlap digraphs, Journal of Mathematical Chemistry, 47, 62-71, (2010) · Zbl 1379.92076
[38] Liu, Y.; Wirawan, A.; Schmidt, B., CUDASW++ 3.0: accelerating Smith-waterman protein database search by coupling CPU and GPU SIMD instructions, BMC Bioinformatics, 14, 1, 1-10, (2013)
[39] Luo, R.; Liu, B.; Xie, Y.; Li, Z.; Huang, W.; Yuan, J., Soapdenovo2: an empirically improved memory-efficient short-Read de novo assembler, Giga Science, 18, 1, 1-6, (2012)
[40] Lysov, Y.; Florent’ev, V.; Khorlin, A.; Khrapko, K.; Shik, V.; Mirzabekov, A., Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. A new method, Doklady Akademii Nauk SSSR, 303, 1508-1511, (1988)
[41] Mahmoudi, S.; Kierzynka, M.; Manneback, P.; Kurowski, K., Real-time motion tracking using optical flow on multiple gpus, Bulletin of the Polish Academy of Sciences: Technical Sciences, 62, 139-150, (2014)
[42] Margulies, M.; Egholm, M.; Altman, W., Genome sequencing in open microfabricated high density picoliter reactors, Nature, 7057, 437, 376-380, (2005)
[43] Medvedev, P.; Georgiou, K.; Myers, G.; Brudno, M., Computability of models for sequence assembly, Lecture Notes in Computer Science, 4645, 289-301, (2007)
[44] Medvedev, P.; Pham, S.; Chaisson, M.; Tesler, G.; Pevzner, P., Paired de Bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers, Lecture Notes in Computer Science, 6577, 238-251, (2011)
[45] Myers, E.; Sutton, G.; Delcher, A., A whole-genome assembly of drosophila, Science, 287, 5461, 2196-2204, (2000)
[46] Pendavingh, R.; Schuurman, P.; Woeginger, G., Recognizing DNA graphs is difficult, Discrete Applied Mathematics, 127, 85-94, (2003) · Zbl 1037.92018
[47] Pevzner, P., L-tuple DNA sequencing: computer analysis, Journal of Biomolecular Structure and Dynamics, 7, 63-73, (1989)
[48] Pevzner, P.; Tang, H.; Waterman, M., An Eulerian path approach to DNA fragment assembly, Proceedings of the National Academy of Sciences, 98, 9748-9753, (2001) · Zbl 0993.92018
[49] Phan, V.; Skiena, S., Dealing with errors in interactive sequencing by hybridization, Bioinformatics, 17, 862-870, (2001)
[50] Preparata, F.; Oliver, J., DNA sequencing by hybridization using semi-degenerate bases, Journal of Computational Biology, 11, 753-765, (2004)
[51] Preparata, F.; Upfal, E., Sequencing-by-hybridization at the information-theory bound: an optimal algorithm, Journal of Computational Biology, 7, 621-630, (2000)
[52] Rognes, T., Faster Smith-waterman database searches with inter-sequence SIMD parallelisation, BMC Bioinformatics, 12, 1, 1-11, (2011)
[53] Ruiqiang, L.; Wei, F.; Geng, T., The sequence and de novo assembly of the giant panda genome, Nature, 7279, 463, 311-317, (2010)
[54] Salzberg, S.; Phillippy, A.; Zimin, A., GAGE: A critical evaluation of genome assemblies and assembly algorithms, Genome Research, 22, 557-567, (2012)
[55] Sanger, F.; Nicklen, S.; Coulson, A., DNA sequencing with chain-terminating inhibitors, Proceedings of the National Academy of Sciences, U.S.A., 74, 5463-5467, (1977)
[56] Simpson, J.; Durbin, R., Efficient de novo assembly of large genomes using compressed data structures, Genome Research, 22, 549-556, (2012)
[57] Southern, E., Analyzing polynucleotide sequences, International Patent Application, (1988)
[58] Vinga, S.; Almeida, J., Alignment-free sequence comparison - a review, Bioinformatics, 19, 4, 513-523, (2003)
[59] Zerbino, D.; Birney, E., Velvet: algorithms for de novo short Read assembly using de Bruijn graphs, Genome Research, 18, 821-829, (2008)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.