×

Compressed spaced suffix arrays. (English) Zbl 1378.68032

Summary: As a first step in designing relatively-compressed data structures – i.e., such that storing an instance for one dataset helps us store instances for similar datasets – we consider how to compress spaced suffix arrays relative to normal suffix arrays and still support fast access to them. This problem is of practical interest when performing similarity search with spaced seeds because using several seeds in parallel significantly improves their performance, but with existing approaches we keep a separate linear-space hash table or spaced suffix array for each seed. We first prove a theoretical upper bound on the space needed to store a spaced suffix array when we already have the suffix array. We then present experiments indicating that our approach works even better in practice.

MSC:

68P05 Data structures
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69, 232-268 (2014) · Zbl 1307.68029 · doi:10.1007/s00453-012-9726-3
[2] Battaglia, G., Cangelosi, D., Grossi, R., Pisanti, N.: Masking patterns in sequences: a new class of motif discovery with don’t cares. Theor. Comput. Sci. 410, 4327-4340 (2009) · Zbl 1187.68288 · doi:10.1016/j.tcs.2009.07.014
[3] Belazzougui, D., Gagie, T., Gog, S., Manzini, G., Sirén, J.: Relative FM-indexes. In: Proceedings of the 21st Symposium on String Processing and Information Retrieval (SPIRE), pp. 52-64 (2014) · Zbl 1187.68288
[4] Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 11(4) (2015) · Zbl 1325.68307
[5] Boucher, C., Bowe, A., Gagie, T., Manzini, G., Sirén, J.: Relative select. In: Proceedings of the 22nd Symposium on String Processing and Information Retrieval (SPIRE), pp. 149-155 (2015)
[6] Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Proceedings of the 12th Workshop on Algorithms in Bioinformatics (WABI), pp. 225-235 (2012) · Zbl 1414.68020
[7] Brown, DG; Mǎndoiu, I. (ed.); Zelikovsky, A. (ed.), A survey of seeding for sequence alignment, 126-152 (2008), Hoboken
[8] Burkhardt, S., Kärkkäinen, J.: Better filtering with gapped q-grams. Fundamenta Informicae 56, 51-70 (2003) · Zbl 1031.68092
[9] Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994) · Zbl 0586.68034
[10] Crochemore, M., Tischler, G.: The gapped suffix array: a new index structure for fast approximate matching. In: Proceedings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pp. 359-364 (2010)
[11] David, M., Dzamba, M., Lister, D., Ilie, L., Brudno, M.: SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 27, 1011-1012 (2011) · doi:10.1093/bioinformatics/btr046
[12] Egidi, L., Manzini, G.: Better spaced seeds using quadratic residues. J. Comput. Syst. Sci. 79, 1144-1155 (2013) · Zbl 1311.68204 · doi:10.1016/j.jcss.2013.03.002
[13] Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 552-581 (2005) · Zbl 1323.68261 · doi:10.1145/1082036.1082039
[14] Gagie, T., Manzini, G., Valenzuela, D.: Compressed spaced suffix arrays. In: Proceedings of the 2nd International Conference on Algorithms for Big Data (ICABD), pp. 37-45 (2014) · Zbl 1378.68032
[15] Gagie, T., Navarro, G., Puglisi, S.J., Sirén, J.: Relative compressed suffix trees. Technical Report. arXiv:1508.02550 (2015) · Zbl 1165.94312
[16] Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale genome resequencing. PLOS One 4, e7767 (2009) · doi:10.1371/journal.pone.0007767
[17] Ilie, L., Ilie, S., Khoshraftar, S., Mansouri Bigvand, A.: Seeds for effective oligonucleotide design. BMC Genomics 12, 280 (2011) · doi:10.1186/1471-2164-12-280
[18] Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226-232 (2012) · doi:10.1038/ng.1028
[19] Kiełbasa, S.M., Wan, R., Sato, K., Horton, P., Frith, M.C.: Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487-493 (2011) · doi:10.1101/gr.113985.110
[20] Langmeand, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357-359 (2012) · doi:10.1038/nmeth.1923
[21] Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440-445 (2002) · doi:10.1093/bioinformatics/18.3.440
[22] Peterlongo, P., Pisanti, N., Boyer, F., Pereira do Lago, A., Sagot, M.: Lossless filter for multiple repetitions with Hamming distance. J. Discrete Algorithms 6(3), 497-509 (2008) · Zbl 1165.94312 · doi:10.1016/j.jda.2007.03.003
[23] Russo, L.M.S., Tischler, G.: Succinct gapped suffix arrays. In: Proceedings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pp. 290-294 (2011)
[24] Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 12, 847-861 (2005) · doi:10.1089/cmb.2005.12.847
[25] Supowit, K.J.: Decomposing a set of points into chains, with applications to permutation and circle graphs. Inform. Process. Lett. 21, 249-252 (1985) · Zbl 0586.68034 · doi:10.1016/0020-0190(85)90093-6
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.