×

zbMATH — the first resource for mathematics

Safe and complete contig assembly via omnitigs. (English) Zbl 1355.92070
Singh, Mona (ed.), Research in computational molecular biology. 20th annual conference, RECOMB 2016, Santa Monica, CA, USA, April 17–21, 2016. Proceedings. Cham: Springer (ISBN 978-3-319-31956-8/pbk; 978-3-319-31957-5/ebook). Lecture Notes in Computer Science 9649. Lecture Notes in Bioinformatics, 152-163 (2016).
Summary: Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs – a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph \(G\) (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from \(G\) as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66 % to 82 % longer on average than the popular unitigs, and 29 % of dbSNP locations have more neighbors in omnitigs than in unitigs.
For the entire collection see [Zbl 1334.92007].

MSC:
92D10 Genetics and epigenetics
92-08 Computational methods for problems pertaining to biology
05C90 Applications of graph theory
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comp. Biol. 19(5), 455–477 (2012) · doi:10.1089/cmb.2012.0021
[2] Boetzer, M., et al.: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27(4), 578–579 (2011) · Zbl 05891138 · doi:10.1093/bioinformatics/btq683
[3] Boetzer, M., Pirovano, W.: Toward almost closed genomes with gapfiller. Genome Biol. 13(6), 1–9 (2012) · doi:10.1186/gb-2012-13-6-r56
[4] Bresler, G., et al.: Optimal assembly for high throughput shotgun sequencing. BMC Bioinform. 14(Suppl 5), S18 (2013)
[5] Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Heidelberg (2014) · Zbl 06294050 · doi:10.1007/978-3-319-05269-4_4
[6] Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 236–248. Springer, Heidelberg (2012) · Zbl 06158299 · doi:10.1007/978-3-642-33122-0_19
[7] Guénoche, A.: Can we recover a sequence, just knowing all its subsequences of given length? Comput. Appl. Biosci. 8(6), 569–574 (1992)
[8] Haussler, D., et al.: Genome 10 K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100(6), 659–674 (2008)
[9] Idury, R.M., Waterman, M.S.: A new algorithm for DNA sequence assembly. J. Comp. Biol. 2(2), 291–306 (1995) · doi:10.1089/cmb.1995.2.291
[10] Jackson, B.G.: Parallel methods for short read assembly. Ph.D. thesis, Iowa State University (2009)
[11] Kapun, E., Tsarev, F.: De Bruijn superwalk with multiplicities problem is NP-hard. BMC Bioinform. 14(Suppl 5), S7 (2013)
[12] Kapun, E., Tsarev, F.: On NP-hardness of the paired de Bruijn sound cycle problem. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 59–69. Springer, Heidelberg (2013) · Zbl 06294230 · doi:10.1007/978-3-642-40453-5_6
[13] Kececioglu, J.D., Myers, E.W.: Combinatiorial algorithms for DNA sequence assembly. Algorithmica 13(1/2), 7–51 (1995) · Zbl 0831.92013 · doi:10.1007/BF01188580
[14] Kececioglu, J.D.: Exact and approximation algorithms for DNA sequence reconstruction. Ph.D. thesis, University of Arizona, Tucson, AZ, USA (1992)
[15] Kingsford, C., et al.: Assembly complexity of prokaryotic genomes using short reads. BMC Bioinform. 11(1), 21 (2010) · Zbl 05766148 · doi:10.1186/1471-2105-11-21
[16] Lam, K., et al.: Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinform. 15(S–9), S4 (2014) · doi:10.1186/1471-2105-15-S9-S4
[17] Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988) · doi:10.1016/0888-7543(88)90007-9
[18] Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 18 (2012) · doi:10.1186/2047-217X-1-18
[19] Lysov, I., et al.: Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. a new method. Dokl Akad Nauk SSSR 303(6), 1508–1511 (1988)
[20] Medvedev, P., Brudno, M.: Maximum likelihood genome assembly. J. Comp. Biol. 16(8), 1101–1116 (2009) · doi:10.1089/cmb.2009.0047
[21] Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for sequence assembly. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 289–301. Springer, Heidelberg (2007) · doi:10.1007/978-3-540-74126-8_27
[22] Medvedev, P., et al.: Paired de Bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. J. Comp. Biol. 18(11), 1625–1634 (2011) · doi:10.1089/cmb.2011.0151
[23] Miller, J.R., et al.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010) · doi:10.1016/j.ygeno.2010.03.001
[24] Motahari, A.S., et al.: Information theory of DNA shotgun sequencing. IEEE Trans. Inf. Theory 59(10), 6273–6289 (2013) · Zbl 1364.94215 · doi:10.1109/TIT.2013.2270273
[25] Myers, E.W.: The fragment assembly string graph. In: ECCB/JBI, p. 85 (2005)
[26] Nagarajan, N., Pop, M.: Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comp. Biol. 16(7), 897–908 (2009) · doi:10.1089/cmb.2009.0005
[27] Nagarajan, N., Pop, M.: Sequence assembly demystified. Nat. Rev. Genet. 14(3), 157–167 (2013) · doi:10.1038/nrg3367
[28] Narzisi, G., Mishra, B., Schatz, M.C.: On algorithmic complexity of biomolecular sequence assembly problem. In: Dediu, A.-H., Martín-Vide, C., Truthe, B. (eds.) AlCoB 2014. LNCS, vol. 8542, pp. 183–195. Springer, Heidelberg (2014) · Zbl 06349974 · doi:10.1007/978-3-319-07953-0_15
[29] Peltola, H., et al.: Algorithms for some string matching problems arising in molecular genetics. In: IFIP Congress, 59–64 (1983)
[30] Pevzner, P.A.: L-Tuple DNA sequencing: computer analysis. J. Biomol. Struct. Dyn. 7(1), 63–73 (1989)
[31] Pevzner, P.A., et al.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. 98(17), 9748–9753 (2001) · Zbl 0993.92018 · doi:10.1073/pnas.171285098
[32] Rubinov, A.R., Gelfand, M.S.: Reconstruction of a string from substring precedence data. J. Comp. Biol. 2(2), 371–381 (1995) · doi:10.1089/cmb.1995.2.371
[33] Sahlin, K., et al.: BESST-efficient scaffolding of large fragmented assemblies. BMC Bioinform. 15(1), 281 (2014) · doi:10.1186/1471-2105-15-281
[34] Salmela, L., Sahlin, K., Mäkinen, V., Tomescu, A.I.: Gap filling as exact path length problem. In: Przytycka, T.M. (ed.) RECOMB 2015. LNCS, vol. 9029, pp. 281–292. Springer, Heidelberg (2015) · Zbl 1355.92033 · doi:10.1007/978-3-319-16706-0_29
[35] Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012) · doi:10.1101/gr.131383.111
[36] Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010) · doi:10.1093/bioinformatics/btq217
[37] Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012) · doi:10.1101/gr.126953.111
[38] Tomescu, A.I., Medvedev, P.: Safe and complete contig assembly via omnitigs (2016). http://arxiv.org/abs/1601.02932 · Zbl 1355.92070
[39] Uricaru, R., et al.: Reference-free detection of isolated SNPs. Nucleic Acids Res. 43(2), e11 (2015) · doi:10.1093/nar/gku1187
[40] Vyahhi, N., Pyshkin, A., Pham, S., Pevzner, P.A.: From de Bruijn graphs to rectangle graphs for genome assembly. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 249–261. Springer, Heidelberg (2012) · Zbl 06158300 · doi:10.1007/978-3-642-33122-0_20
[41] Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008) · doi:10.1101/gr.074492.107
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.