×

Fixed block compression boosting in FM-indexes: theory and practice. (English) Zbl 1422.68047

Summary: The FM index [P. Ferragina and G. Manzini, J. ACM 52, No. 4, 552–581 (2005; Zbl 1323.68261)] is a widely-used compressed data structure that stores a string \(T\) in a compressed form and also supports fast pattern matching queries. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation and improved practical performance. Our main theoretical result is a new technique called fixed block compression boosting, which is a simpler and faster alternative to optimal compression boosting and implicit compression boosting used in previous FM-indexes. We also describe several new techniques for implementing fixed-block boosting efficiently, including a new, fast, and space-efficient implementation of wavelet trees. Our extensive experiments show the new indexes to be consistently fast and small relative to the state-of-the-art, and thus they make a good “off-the-shelf” choice for many applications.

MSC:

68P05 Data structures
68P30 Coding and information theory (compaction, compression, models of communication, encoding schemes, etc.) (aspects in computer science)

Citations:

Zbl 1323.68261

Software:

LZ77; Soap
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Belazzougui, D., Gagie, T., Mäkinen, V., Previtali, M., Puglisi, S.J.: Bidirectional variable-order de Bruijn graphs. In: 12th Latin American Theoretical Informatics Symposium (LATIN 2016), Ensenada, 11-15 Apr 2016. LNCS, vol. 9644, pp. 164-178. Springer (2016) · Zbl 1415.68063
[2] Boucher, C., Bowe, A., Gagie, T., Puglisi, S.J., Sadakane, K.: Variable-order de Bruijn graphs. In: 2015 Data Compression Conference (DCC 2015), Snowbird, 7-9 Apr 2015, pp. 383-392. IEEE (2015)
[3] Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: 12th Workshop on Algorithms in Bioinformatics (WABI 2012), Ljubljana, 10-12 Sept 2012. LNCS, vol. 7534, pp. 225-235. Springer (2012) · Zbl 1414.68020
[4] Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Palo Alto, California (1994)
[5] Claude, F., Navarro, G., Pereira, A.O.: The wavelet matrix: an efficient wavelet tree for large alphabets. Inf. Syst. 47, 15-32 (2015) · doi:10.1016/j.is.2014.06.002
[6] Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. J. ACM 52(4), 688-713 (2005) · Zbl 1323.68260 · doi:10.1145/1082036.1082043
[7] Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 1.12-1.31 (2009) · Zbl 1284.68255
[8] Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552-581 (2005) · Zbl 1323.68261 · doi:10.1145/1082036.1082039
[9] Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: An alphabet-friendly FM-index. In: 11th International Symposium on String Processing and Information Retrieval (SPIRE 2004), Padova, 5-8 Oct 2004. LNCS, vol. 3246, pp. 150-160. Springer (2004) · Zbl 1111.68429
[10] Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Article 20 (2007) · Zbl 1321.68263
[11] Ferragina, P., Nitto, I., Venturini, R.: On optimally partitioning a text to improve its compression. Algorithmica 61(1), 51-74 (2011) · Zbl 1221.68302 · doi:10.1007/s00453-010-9437-6
[12] Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nat. Methods 6(11 Suppl), S6-S12 (2009) · doi:10.1038/nmeth.1376
[13] Gog, S., Kärkkäinen, J., Kempa, D., Petri, M., Puglisi, S.J.: Faster, minuter. In: 2016 Data Compression Conference (DCC 2016), Snowbird, 30 Mar-1 Apr, 2016, pp. 53-62. IEEE (2016)
[14] Gog, S., Petri, M.: Optimized succinct data structures for massive data. Softw. Pract. Exp. 44(11), 1287-1314 (2014) · doi:10.1002/spe.2198
[15] Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: 14th Annual Symposium on Discrete Algorithms (SODA 2003), Baltimore, 12-14 Jan 2003, pp. 841-850. SIAM (2003) · Zbl 1092.68584
[16] Grossi, R., Gupta, A., Vitter, J.S.: When indexing equals compression: experiments with compressing suffix arrays and applications. In: 15th Annual Symposium on Discrete Algorithms (SODA 2004), New Orleans, 11-14 Jan 2004, pp. 636-645. SIAM (2004) · Zbl 1318.68079
[17] Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Hybrid compression of bitvectors for the FM-index. In: 2014 Data Compression Conference (DCC 2014), Snowbird, 26-28 Mar 2014, pp. 302-311. IEEE (2014)
[18] Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel-Ziv parsing in external memory. In: 2014 Data Compression Conference (DCC 2014), Snowbird, 26-28 Mar 2014, pp. 153-162. IEEE (2014)
[19] Kärkkäinen, J., Puglisi, S.J.: Fixed block compression boosting in FM-indexes. In: 18th International Symposium on String Processing and Information Retrieval (SPIRE 2011), Pisa, 17-21 Oct 2011. LNCS, vol. 7024, pp. 174-184. Springer (2011)
[20] Kempa, D., Puglisi, S.J.: Lempel-Ziv factorization: simple, fast, practical. In: 15th Meeting on Algorithm Engineering and Experiments (ALENEX 2013), New Orleans, 7 Jan 2013, pp. 103-112. SIAM (2013) · Zbl 1430.68462
[21] Li, R., Yu, C., Li, Y., Lam, T.W., Yiu, S.M., Kristiansen, K., Wang, J.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25(15), 1966-1967 (2009) · doi:10.1093/bioinformatics/btp336
[22] Mäkinen, V., Belazzougui, D., Cunial, F., Tomescu, A.I.: Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, Cambridge (2015) · doi:10.1017/CBO9781139940023
[23] Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12(1), 40-66 (2005) · Zbl 1085.68031
[24] Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: 14th International Symposium on String Processing and Information Retrieval (SPIRE 2007), Santiago, 29-31 Oct 2007. LNCS, vol. 4726, pp. 229-241. Springer (2007)
[25] Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281-308 (2010) · doi:10.1089/cmb.2009.0169
[26] Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407-430 (2001) · Zbl 1323.68262 · doi:10.1145/382780.382782
[27] Navarro, G., Providel, E.: Fast, small, simple rank/select on bitmaps. In: 11th International Symposium on Experimental Algorithms (SEA 2012), Bordeaux, 7-9 June 2012. LNCS, vol. 7276, pp. 295-306. Springer (2012)
[28] Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: 9th Meeting on Algorithm Engineering and Experiments (ALENEX 2007), New Orleans, 6 Jan 2007, pp. 60-70. SIAM (2007) · Zbl 1428.68134
[29] Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms 3(4), 43 (2007) · Zbl 1446.68046 · doi:10.1145/1290672.1290680
[30] Shareghi, E., Petri, M., Haffari, G., Cohn, T.: Compact, efficient and unlimited capacity: language modeling with compressed suffix trees. In: 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, 17-21 Sept 2015, pp. 2409-2418. The Association for Computational Linguistics (2015)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.