Polevikov, Evgeny; Kolmogorov, Mikhail Synteny paths for assembly graphs comparison. (English) Zbl 1495.92040 Huber, Katharina T. (ed.) et al., 19th international workshop on algorithms in bioinformatics, WABI 2019, Niagara Falls, NY, USA, September 8–10, 2019. Proceedings. Wadern: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. LIPIcs – Leibniz Int. Proc. Inform. 143, Article 24, 14 p. (2019). Summary: Despite the recent developments of long-read sequencing technologies, it is still difficult to produce complete assemblies of eukaryotic genomes in an automated fashion. Genome assembly software typically output assembled fragments (contigs) along with assembly graphs, that encode all possible layouts of these contigs. Graph representation of the assembled genome can be useful for gene discovery, haplotyping, structural variations analysis and other applications. To facilitate the development of new graph-based approaches, it is important to develop algorithms for comparison and evaluation of assembly graphs produced by different software. In this work, we introduce synteny paths: maximal paths of homologous sequence between the compared assembly graphs. We describe Asgan – an algorithm for efficient synteny paths decomposition, and use it to evaluate assembly graphs of various bacterial assemblies produced by different approaches. We then apply Asgan to discover structural variations between the assemblies of 15 Drosophila genomes, and show that synteny paths are robust to contig fragmentation. The Asgan tool is freely available at: https://github.com/epolevikov/Asgan.For the entire collection see [Zbl 1423.68029]. MSC: 92D10 Genetics and epigenetics 92-08 Computational methods for problems pertaining to biology Keywords:assembly graphs; genome assembly; synteny blocks; comparative genomics Software:HINGE; hybridSPAdes; CYNTENATOR; i-ADHoRe; PathRacer; CAMSA; GTED; Canu; Bandage; Unicycler; AGB; Asgan; Minimap2; Sibelia; SPAdes; dipSPAdes PDF BibTeX XML Cite \textit{E. Polevikov} and \textit{M. Kolmogorov}, LIPIcs -- Leibniz Int. Proc. Inform. 143, Article 24, 14 p. (2019; Zbl 1495.92040) Full Text: DOI References: [1] Sergey S Aganezov and Max A Alekseyev. CAMSA: a tool for comparative analysis and merging of scaffold assemblies. BMC bioinformatics, 18(15):496, 2017. [2] Dmitry Antipov, Anton Korobeynikov, Jeffrey S McLean, and Pavel A Pevzner. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics, 32(7):1009-1015, 2015. 24:13 [3] Pavel Avdeyev, Shuai Jiang, Sergey Aganezov, Fei Hu, and Max A Alekseyev. Reconstruction of ancestral genomes in presence of gene gain and loss. Journal of Computational Biology, 23(3):150-164, 2016. [4] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5):455-477, 2012. [5] Ali Ebrahimpour Boroojeny, Akash Shrestha, Ali Sharifi-Zarchi, Suzanne Renick Gallagher, S Cenk Sahinalp, and Hamidreza Chitsaz. GTED: Graph traversal edit distance. In Research in Computational Molecular Biology, pages 37-53. Springer, 2018. · Zbl 1510.92087 [6] Drosophila 12 Genomes Consortium et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature, 450(7167):203, 2007. [7] Jack Edmonds. Paths, trees, and flowers. Canadian Journal of mathematics, 17:449-467, 1965. · Zbl 0132.20903 [8] Cristina G Ghiurcuta and Bernard ME Moret. Evaluating synteny for improved comparative studies. Bioinformatics, 30(12):i9-i18, 2014. [9] Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, 44(2):226, 2012. [10] Govinda M Kamath, Ilan Shomorony, Fei Xia, Thomas A Courtade, and N Tse David. HINGE: long-read assembly achieves optimal repeat resolution. Genome research, 27(5):747-756, 2017. [11] W James Kent, Robert Baertsch, Angie Hinrichs, Webb Miller, and David Haussler. Evolu-tion’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proceedings of the National Academy of Sciences, 100(20):11484-11489, 2003. [12] Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A Pevzner. Assembly of long, error-prone reads using repeat graphs. Nature biotechnology, 37(5):540, 2019. [13] Sergey Koren, Brian P Walenz, Konstantin Berlin, Jason R Miller, Nicholas H Bergman, and Adam M Phillippy. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5):722-736, 2017. [14] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10 (8), pages 707-710, 1966. [15] Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, 2018. [16] Yu Lin, Sergey Nurk, and Pavel A Pevzner. What is the difference between the breakpoint graph and the de Bruijn graph? BMC genomics, 15(6):S6, 2014. [17] Dang Liu, Martin Hunt, and Isheng J Tsai. Inferring synteny between genome assemblies: a systematic evaluation. BMC bioinformatics, 19(1):26, 2018. [18] Serghei Mangul and David Koslicki. Reference-free comparison of microbial communities via de Bruijn graphs. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 68-77. ACM, 2016. [19] Alla Mikheenko and Mikhail Kolmogorov. Assembly Graph Browser: interactive visualization of assembly graphs. Bioinformatics, 2019. [20] Danny E Miller, Cynthia Staber, Julia Zeitlinger, and R Scott Hawley. Highly contiguous genome assemblies of 15 Drosophila species generated using nanopore sequencing. G3: Genes, Genomes, Genetics, 8(10):3131-3141, 2018. [21] Ilya Minkin, Anand Patel, Mikhail Kolmogorov, Nikolay Vyahhi, and Son Pham. Sibelia: a scalable and comprehensive synteny block generation tool for closely related microbial genomes. In International Workshop on Algorithms in Bioinformatics, pages 215-229. Springer, 2013. [22] Supratim Mukherjee, Dimitri Stamatis, Jon Bertsch, Galina Ovchinnikova, Hema Y Katta, Alejandro Mojica, I-Min A Chen, Nikos C Kyrpides, and TBK Reddy. Genomes OnLine database (GOLD) v. 7: updates and new features. Nucleic acids research, 47(D1):D649-D659, 2018. [23] Synteny Paths for Assembly Graphs Comparison [24] Eugene W Myers, Granger G Sutton, Art L Delcher, Ian M Dew, Dan P Fasulo, Michael J Flanigan, Saul A Kravitz, Clark M Mobarry, Knut HJ Reinert, Karin A Remington, et al. A whole-genome assembly of Drosophila. Science, 287(5461):2196-2204, 2000. [25] Pavel Pevzner and Glenn Tesler. Transforming men into mice: the Nadeau-Taylor chromosomal breakage model revisited. In Proceedings of the seventh annual international conference on Research in computational molecular biology, pages 247-256. ACM, 2003. [26] Pavel A Pevzner, Haixu Tang, and Glenn Tesler. De novo repeat classification and fragment assembly. Genome research, 14(9):1786-1796, 2004. [27] Pavel A Pevzner, Haixu Tang, and Michael S Waterman. An Eulerian path approach to DNA fragment assembly. Proceedings of the national academy of sciences, 98(17):9748-9753, 2001. · Zbl 0993.92018 [28] Adam M Phillippy. New advances in sequence assembly. Genome Research, 27:xi-xiii, 2017. [29] Sebastian Proost, Jan Fostier, Dieter De Witte, Bart Dhoedt, Piet Demeester, Yves Van de Peer, and Klaas Vandepoele. i-ADHoRe 3.0-fast and sensitive detection of genomic homology in extremely large data sets. Nucleic acids research, 40(2):e11-e11, 2011. [30] Christian Rödelsperger and Christoph Dieterich. CYNTENATOR: progressive gene order alignment of 17 vertebrate genomes. PloS one, 5(1):e8861, 2010. [31] Yana Safonova, Anton Bankevich, and Pavel A Pevzner. dipSPAdes: assembler for highly polymorphic diploid genomes. Journal of Computational Biology, 22(6):528-545, 2015. [32] Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for recon-structing phylogenetic trees. Molecular biology and evolution, 4(4):406-425, 1987. [33] Michael C Schatz, Arthur L Delcher, and Steven L Salzberg. Assembly of large genomes using second-generation sequencing. Genome research, 20(9):1165-1173, 2010. [34] Alex Shlemov and Anton Korobeynikov. PathRacer: racing profile HMM paths on assembly graph. BioRxiv, page 562579, 2019. · Zbl 1416.92118 [35] Mitchell R Vollger, Philip C Dishuck, Melanie Sorensen, AnneMarie E Welch, Vy Dang, Max L Dougherty, Tina A Graves-Lindsay, Richard K Wilson, Mark JP Chaisson, and Evan E Eichler. Long-read sequence and assembly of segmental duplications. Nature methods, 16(1):88, 2019. [36] Ryan R Wick, Louise M Judd, Claire L Gorrie, and Kathryn E Holt. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS computational biology, 13(6):e1005595, 2017. [37] Ryan R Wick, Mark B Schultz, Justin Zobel, and Kathryn E Holt. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31(20):3350-3352, 2015. [38] Aleksey V Zimin, Douglas R Smith, Granger Sutton, and James A Yorke. Assembly reconcili-ation. Bioinformatics, 24(1):42-45, 2007. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.