×

Linear-time superbubble identification algorithm for genome assembly. (English) Zbl 1331.92091

Summary: DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual’s genome in order to catalogue sequence variation and understand its biological implications. Whole-genome sequencing techniques produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly algorithms utilise de Bruijn graphs constructed from reads for this purpose. A critical step of these algorithms is to detect typical motif structures in the graph caused by sequencing errors and genome repeats, and filter them out; one such complex subgraph class is a so-called superbubble. In this paper, we propose an \(\mathcal{O}(n + m)\)-time algorithm to detect all superbubbles in a directed acyclic graph with \(n\) vertices and \(m\) (directed) edges, improving the best-known \(\mathcal{O}(m \log m)\)-time algorithm by Sung et al.

MSC:

92D10 Genetics and epigenetics
05C85 Graph algorithms (graph-theoretic aspects)
68Q25 Analysis of algorithms and problem complexity

Software:

ALLPATHS; Velvet
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Lander, E. S.; Linton, L. M.; Birren, B.; Nusbaum, C.; Zody, M. C.; Baldwin, J.; Devon, K.; Dewar, K.; Doyle, M.; FitzHugh, W., Initial sequencing and analysis of the human genome, Nature, 409, 6822, 860-921 (2001)
[2] Venter, J. C.; Adams, M. D.; Myers, E. W.; Li, P. W.; Mural, R. J.; Sutton, G. G.; Smith, H. O.; Yandell, M.; Evans, C. A.; Holt, R. A., The sequence of the human genome, Science, 291, 5507, 1304-1351 (2001)
[4] Batzoglou, S., Algorithmic challenges in mammalian genome sequence assembly, (Encyclopedia of Genomics, Proteomics and Bioinformatics (2005), John Wiley and Sons: John Wiley and Sons Hoboken, New Jersey)
[5] Butler, J.; MacCallum, I.; Kleber, M.; Shlyakhter, I. A.; Belmonte, M. K.; Lander, E. S.; Nusbaum, C.; Jaffe, D. B., ALLPATHS: de novo assembly of whole-genome shotgun microreads, Genome Res., 18, 5, 810-820 (2008)
[6] Pevzner, P. A.; Tang, H.; Waterman, M. S., An Eulerian path approach to DNA fragment assembly, Proc. Natl. Acad. Sci. USA, 98, 17, 9748-9753 (2001) · Zbl 0993.92018
[7] de Bruijn, N. G., A combinatorial problem, K. Ned. Akad. Wet., 49, 758-764 (1946) · Zbl 0060.02701
[8] Zerbino, D. R.; Birney, E., Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res., 18, 5, 821-829 (2008)
[9] Onodera, T.; Sadakane, K.; Shibuya, T., Detecting superbubbles in assembly graphs, (WABI (2013)), 338-348
[10] Sung, W.; Sadakane, K.; Shibuya, T.; Belorkar, A.; Pyrogova, I., An O(m log m)-time algorithm for detecting superbubbles, IEEE/ACM Trans. Comput. Biology Bioinform., 12, 4, 770-777 (2015)
[11] Thomas, R. L.R.; Cormen, H.; Leiserson, Charles E.; Stein, C., Introduction to Algorithms (2001), MIT Press: MIT Press Cambridge, MA · Zbl 1047.68161
[12] Tarjan, R., Edge-disjoint spanning trees and depth-first search, Acta Inform., 6, 2, 171-185 (1976) · Zbl 0307.05104
[13] Harel, D.; Tarjan, R., Fast algorithms for finding nearest common ancestors, SIAM J. Comput., 13, 2, 338-355 (1984) · Zbl 0535.68022
[14] Fischer, J.; Heun, V., Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE, (Lewenstein, M.; Valiente, G., Combinatorial Pattern Matching, Proceedings of the 17th Annual Symposium. Combinatorial Pattern Matching, Proceedings of the 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006. Combinatorial Pattern Matching, Proceedings of the 17th Annual Symposium. Combinatorial Pattern Matching, Proceedings of the 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Lecture Notes in Computer Science, vol. 4009 (2006), Springer), 36-48 · Zbl 1196.68068
[15] Durocher, S., A simple linear-space data structure for constant-time range minimum query, (Brodnik, A.; López-Ortiz, A.; Raman, V.; Viola, A., Space-Efficient Data Structures, Streams, and Algorithms. Space-Efficient Data Structures, Streams, and Algorithms, Lecture Notes in Computer Science, vol. 8066 (2013), Springer Berlin Heidelberg), 48-60 · Zbl 1394.68093
[16] Nurk, S.; Bankevich, A.; Antipov, D.; Gurevich, A. A.; Korobeynikov, A.; Lapidus, A.; Prjibelski, A. D.; Pyshkin, A.; Sirotkin, A.; Sirotkin, Y.; Stepanauskas, R.; Clingenpeel, S. R.; Woyke, T.; McLean, J. S.; Lasken, R.; Tesler, G.; Alekseyev, M. A.; Pevzner, P. A., Assembling single-cell genomes and mini-metagenomes from chimeric MDA products, J. Comput. Biol., 20, 10, 714-737 (2013)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.