Weese, David; Siragusa, Enrico Full-text indexes for high-throughput sequencing. (English) Zbl 1457.68076 Elloumi, Mourad (ed.), Algorithms for next-generation sequencing data. Techniques, approaches, and applications. Cham: Springer. 41-75 (2017). Summary: Recent advances in High-Throughput Sequencing demand for novel algorithms working on efficient data structures specifically designed for the analysis of large volumes of sequence data. This chapter describes such data structures, called full-text indexes, to represent all substrings (or substrings up to a certain length) contained in a given text (or text collection).For the entire collection see [Zbl 1383.68005]. MSC: 68P05 Data structures 68W32 Algorithms on strings Software:MicroRazerS; STELLAR; PATRICIA; BWA; Samtools; Fiona; SeqAn PDF BibTeX XML Cite \textit{D. Weese} and \textit{E. Siragusa}, in: Algorithms for next-generation sequencing data. Techniques, approaches, and applications. Cham: Springer. 41--75 (2017; Zbl 1457.68076) Full Text: DOI References: [1] Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53-86 (2004) · Zbl 1115.92303 [2] Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching. Springer Science & Business Media, Berlin (2008) [3] Arlazarov, V., Dinic, E., Kronrod, M., Faradzev, I.: On economical construction of the transitive closure of a directed graph. Dokl. Akad. Nauk 11, 194 (1970) · Zbl 0214.23601 [4] Bauer, M.J., Cox, A.J., Rosone, G., Sciortino, M.: Lightweight LCP construction for next-generation sequencing datasets. In: Algorithms in Bioinformatics, pp. 326-337. Springer, Berlin (2012) · Zbl 1370.68338 [5] Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight algorithms for constructing and inverting the bwt of string collections. Theor. Comput. Sci. 483, 134-148 (2013) · Zbl 1292.68176 [6] Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fund. Inform. 56(1,2), 51-70 (2003) · Zbl 1031.68092 [7] Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.P., Rivals, E., Vingron, M.: q-gram based database searching using suffix arrays. In: Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB-99), pp. 77-83 (1999) [8] Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital SRC Research Report (1994) [9] Cazaux, B., Lecroq, T., Rivals, E.: From indexing data structures to de Bruijn graphs. In: Combinatorial Pattern Matching, pp. 89-99. Springer, Berlin (2014) · Zbl 1407.68107 [10] Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT, Cambridge, MA (2001) · Zbl 1047.68161 [11] Crochemore, M., Grossi, R., Kärkkäinen, J., Landau, G.M.: A constant-space comparison-based algorithm for computing the Burrows-Wheeler transform. In: Combinatorial Pattern Matching, pp. 74-82. Springer, Berlin (2013) · Zbl 1381.68313 [12] Döring, A., Weese, D., Rausch, T., Reinert, K.: SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinf. 9, 11 (2008) [13] Emde, A.K., Grunert, M., Weese, D., Reinert, K., Sperling, S.R.: MicroRazerS: rapid alignment of small RNA reads. Bioinformatics 26(1), 123-124 (2010) [14] Emde, A.K., Schulz, M.H., Weese, D., Sun, R., Vingron, M., Kalscheuer, V.M., Haas, S.A., Reinert, K.: Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using splazers. Bioinformatics 28(5), 619-627 (2012) [15] Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987-1011 (2000) · Zbl 1094.68694 [16] Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13 (2013) · Zbl 1293.68314 [17] Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552-581 (2005) · Zbl 1323.68261 [18] Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707-730 (2012) · Zbl 1241.68062 [19] Galil, Z., Giancarlo, R.: Data structures and algorithms for approximate string matching. J. Complexity 4(1), 33-72 (1988) · Zbl 0646.68078 [20] Giegerich, R., Kurtz, S.: A comparison of imperative and purely functional suffix tree constructions. Sci. Comput. Program. 25, 187-218 (1995) · Zbl 0853.68100 [21] Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Softw. Pract. Exp. 33(11), 1035-1049 (2003) [22] Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Experimental Algorithms, pp. 326-337. Springer, Berlin (2014) [23] Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378-407 (2005) · Zbl 1092.68115 [24] Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pp. 841-850. Society for Industrial and Applied Mathematics, Philadelphia, PA (2003) · Zbl 1092.68584 [25] Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997) · Zbl 0934.68103 [26] Hamming, R.W.: Error detecting and error correcting codes. Syst. Tech. J. 29, 147-160 (1950) · Zbl 1402.94084 [27] Hon, W.K., Lam, T.W., Sadakane, K., Sung, W.K., Yiu, S.M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23-36 (2007) · Zbl 1123.68137 [28] Intel: Intel^®; 64 and IA-32 Architectures Optimization Reference Manual. Intel Corporation, Santa Clara, CA (2011) [29] Jacobson, G.: Space-efficient static trees and graphs. In: 30th Annual Symposium on Foundations of Computer Science, 1989, pp. 549-554. IEEE, New York (1989) [30] Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, CPM ’01, pp. 181-192. Springer, Berlin (2001) · Zbl 0990.68639 [31] Kehr, B., Weese, D., Reinert, K.: Stellar: fast and exact local alignments. BMC Bioinf. 12(Suppl. 9), S15 (2011) [32] Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3), R25 (2009) [33] Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754-1760 (2009) [34] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup: The sequence alignment/map format and SAMtools. Bioinformatics 25(16), 2078-2079 (2009) [35] Louza, F.A., Telles, G.P., Ciferri, C.D.D.A.: External memory generalized suffix and LCP arrays construction. In: Combinatorial Pattern Matching, pp. 201-210. Springer, Berlin (2013) · Zbl 1381.68072 [36] Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. In: SODA ’90, pp. 319-327. SIAM, Philadelphia (1990) · Zbl 0800.68364 [37] Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935-948 (1993) · Zbl 0784.68027 [38] Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows-Wheeler transform. Theor. Comput. Sci. 387(3), 298-312 (2007) · Zbl 1144.68024 [39] Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407-430 (2001) · Zbl 1323.68262 [40] McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262-272 (1976) · Zbl 0329.68042 [41] Morrison, D.R.: Patricia – practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514-534 (1968) [42] Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39(1), 2:1-2:61 (2007) · Zbl 1321.68263 [43] Ohlebusch, E.: Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch, Bremen (2013) · Zbl 1295.92011 [44] Puglisi, S., Smyth, W., Turpin, A.: A taxonomy of suffix array construction algorithms. In: Holub, J. (ed.) Proceedings of Prague Stringology Conference ’05, Prague, pp. 1-30 (2005) [45] Rausch, T., Emde, A.K., Weese, D., Döring, A., Notredame, C., Reinert, K.: Segment-based multiple sequence alignment. Bioinformatics 24(16), i187-192 (2008) [46] Rausch, T., Koren, S., Denisov, G., Weese, D., Emde, A.K., Döring, A., Reinert, K.: A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics 25(9), 1118-1124 (2009) [47] Schulz, M.H., Weese, D., Rausch, T., Döring, A., Reinert, K., Vingron, M.: Fast and adaptive variable order Markov chain construction. In: Crandall, K., Lagergren, J. (eds.) Algorithms in Bioinformatics. Lecture Notes in Computer Science, vol. 5251, pp. 306-317. Springer, Berlin (2008) [48] Schulz, M.H., Weese, D., Holtgrewe, M., Dimitrova, V., Niu, S., Reinert, K., Richard, H.: Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30(17), i356-i363 (2014) [49] Shi, F.: Suffix arrays for multiple strings: a method for on-line multiple string searches. In: Jaffar, J., Yap, R. (eds.) Concurrency and Parallelism, Programming, Networking, and Security. Lecture Notes in Computer Science, vol. 1179, pp. 11-22. Springer, Berlin (1996). DOI 10.1007/BFb0027775. http://dx.doi.org/10.1007/BFb0027775 [50] Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549-556 (2012) [51] Siragusa, E.: Approximate string matching for high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2015) [52] Siragusa, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78 (2013) [53] Siragusa, E., Weese, D., Reinert, K.: Scalable string similarity search/join with approximate seeds and multiple backtracking. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 370-374. ACM, New York (2013) [54] Ukkonen, E.: Approximate string-matching over suffix trees. In: Combinatorial Pattern Matching, pp. 228-242. Springer, Berlin (1993) [55] Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249-260 (1995) · Zbl 0831.68027 [56] Weese, D.: Indices and applications in high-throughput sequencing. Ph.D. thesis, Freie Universität Berlin (2013) [57] Weese, D., Schulz, M.H.: Efficient string mining under constraints via the deferred frequency index. In: Proceedings of the 8th Industrial Conference on Data Mining (ICDM ’08). LNAI, vol. 5077, pp. 374-388. Springer, Berlin (2008) [58] Weiner, P. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.