Approximate search of short patterns with high error rates using the \(01^\ast 0\) lossless seeds. (English) Zbl 1362.68306

Summary: Approximate pattern matching is an important computational problem that has a wide range of applications in computational biology and in information retrieval. However, searching a short pattern in a text with high error rates (10–20%) under the Levenshtein distance is a task for which few efficient solutions exist. Here we address this problem by introducing a new type of seeds: the \(01^\ast 0\) seeds. These seeds are made of two exact parts separated by parts with exactly one error. We show that those seeds are lossless, and we apply them to two filtration algorithms for two popular applications, one where a compressed index is built on the text and another one where the patterns are indexed. We also demonstrate experimentally the advantages of our approach compared to alternative methods implementing other types of seeds. This work opens the way to the design of more efficient and more sensitive text algorithms.


68W32 Algorithms on strings
92D10 Genetics and epigenetics
Full Text: DOI


[1] Altschul, S.; Gish, W.; Miller, W.; Myers, E.; Lipman, D., Basic local alignment search tool, J. Mol. Biol., 215, 3, 403-410, (1990)
[2] Baeza-Yates, R.; Perleberg, C., Fast and practical approximate string matching, Inf. Process. Lett., 59, 1, 21-27, (1996) · Zbl 1046.68514
[3] Belazzougui, D., Improved space-time tradeoffs for approximate full-text indexing with one edit error, Algorithmica, 1-27, (2014)
[4] Belazzougui, D.; Cunial, F.; Kärkkäinen, J.; Mäkinen, V., Versatile succinct representations of the bidirectional Burrows-Wheeler transform, (Algorithms, ESA, Lect. Notes Comput. Sci., vol. 8125, (2013), Springer), 133-144 · Zbl 1394.68449
[5] Brown, D. G.; Li, M.; Ma, B., A tutorial of recent developments in the seeding of local alignment, J. Bioinform. Comput. Biol., 2, 4, 819-842, (2004)
[6] Burkhardt, S.; Kärkkäinen, J., One-gapped q-Gram filters for levenshtein distance, (Combinatorial Pattern Matching, Lect. Notes Comput. Sci., vol. 2373, (2002)), 225-234 · Zbl 1077.68945
[7] Chan, H.-L.; Lam, T.-W.; Sung, W.-K.; Tam, S.-L.; Wong, S.-S., A linear size index for approximate pattern matching, J. Discret. Algorithms, 9, 4, 358-364, (2011) · Zbl 1230.68223
[8] Chávez, E.; Navarro, G., A metric index for approximate string matching, (Rajsbaum, S., LATIN: Theoretical Informatics, Lect. Notes Comput. Sci., vol. 2286, (2002), Springer), 181-195 · Zbl 1059.68637
[9] Chorostecki, U.; Crosa, V.; Lodeyro, A.; Bologna, N.; Martin, A.; Carrillo, N.; Schommer, C.; Palatnik, J., Identification of new microrna-regulated genes by conserved targeting in plant species, Nucleic Acids Res., 40, 18, 8893-8904, (2012)
[10] Döring, A.; Weese, D.; Rausch, T.; Reinert, K., Seqan an efficient, generic C++ library for sequence analysis, BMC Bioinform., 9, 1, 11-19, (2008)
[11] Ferragina, P.; Manzini, G., Indexing compressed text, J. ACM, 52, 4, 552-581, (2005) · Zbl 1323.68261
[12] Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G., Compressed representations of sequences and full-text indexes, ACM Trans. Algorithms, 3, 2, (2007) · Zbl 1321.68263
[13] Ferragina, P.; González, R.; Navarro, G.; Venturini, R., Compressed text indexes: from theory to practice, J. Exp. Algorithmics, 13, 12, (2009) · Zbl 1284.68255
[14] Hyyrö, H., A bit-vector algorithm for computing levenshtein and damerau edit distances, Nord. J. Comput., 10, 1, 29-39, (2003) · Zbl 1065.68057
[15] Kärkkäinen, J.; Na, J. C., Faster filters for approximate string matching, (ALENEX, (2007), SIAM), 84-90
[16] Keich, U.; Li, M. L.; Ma, B.; Tromp, J., On spaced seeds for similarity search, Discrete Appl. Math., 138, 3, 253-263, (2004) · Zbl 1043.92009
[17] Kozomara, A.; Griffiths-Jones, S., Mirbase: annotating high confidence micrornas using deep sequencing data, Nucleic Acids Res., 42, D68-D73, (2014)
[18] Kucherov, G.; Noé, L.; Roytberg, M., A unifying framework for seed sensitivity and its application to subset seeds, J. Bioinform. Comput. Biol., 4, 2, 553-569, (2006)
[19] Kucherov, G.; Noé, L.; Roytberg, M., Subset seed automaton, (International Conference on Implementation and Application of Automata, CIAA, Lect. Notes Comput. Sci., vol. 4783, (2007)), 180-191 · Zbl 1139.68369
[20] Kucherov, G.; Salikhov, K.; Tsur, D., Approximate string matching using a bidirectional index, (Combinatorial Pattern Matching, Lect. Notes Comput. Sci., (2014), Springer), 222-231 · Zbl 1409.68356
[21] Langmead, B.; Salzberg, S., Fast gapped-Read alignment with bowtie 2, Nat. Methods, 9, 4, 357-359, (2012)
[22] Li, H.; Durbin, R., Fast and accurate short Read alignment with Burrows-Wheeler transform, Bioinformatics, 25, 14, 1754-1760, (2009)
[23] Ma, B.; Tromp, J.; Li, M., Patternhunter - faster and more sensitive homology search, Bioinformatics, 18, 3, 440-445, (2002)
[24] Maaß, M. G.; Nowak, J., Text indexing with errors, J. Discret. Algorithms, 5, 4, 662-681, (2007) · Zbl 1158.68382
[25] Mäkinen, V.; Välimäki, N.; Laaksonen, A.; Katainen, R., Unified view of backward backtracking in short Read mapping, (Algorithms and Applications, (2010), Springer), 182-195 · Zbl 1284.92075
[26] Navarro, G.; Baeza-Yates, R., A hybrid indexing method for approximate string matching, J. Discret. Algorithms, 1, 19-27, (2001)
[27] Petri, M.; Culpepper, J. S., Efficient indexing algorithms for approximate pattern matching in text, (Proceedings of the Seventeenth Australasian Document Computing Symposium, ADCS’12, (2012), ACM New York, NY, USA), 9-16
[28] Russo, L.; Navarro, G.; Oliveira, A. L.; Morales, P., Approximate string matching with compressed indexes, Algorithms, 2, 3, 1105-1136, (2009) · Zbl 1461.68271
[29] Schbath, S.; Martin, V.; Zytnicki, M.; Fayolle, J.; Loux, V.; Gibrat, J.-F., Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis, J. Comput. Biol., 19, 6, 796-813, (2012)
[30] Schnattinger, T.; Ohlebusch, E.; Gog, S., Bidirectional search in a string with wavelet trees, (Combinatorial Pattern Matching, Lect. Notes Comput. Sci., vol. 6129, (2010), Springer), 40-50 · Zbl 1286.68533
[31] Slater, G.; Birney, E., Automated generation of heuristics for biological sequence comparison, BMC Bioinform., 6, 1-11, (2005)
[32] Thompson, J.; Milos, P., The properties and applications of single-molecule DNA sequencing, Genome Biol., 12, 2, 217-226, (2011)
[33] Vroland, C.; Salson, M.; Touzet, H., Lossless seeds for searching short patterns with high error rates, (IWOCA (International Workshop On Combinatorial Algorithms), Lect. Notes Comput. Sci., vol. 8986, (2014)), 364-375 · Zbl 1401.68259
[34] Wagner, R.; Fischer, M., The string-to-string correction problem, J. ACM, 21, 1, 168-173, (1974) · Zbl 0278.68032
[35] Weese, D.; Holtgrewe, M.; Reinert, K., Razers 3: faster, fully sensitive Read mapping, Bioinformatics, 28, 20, 2592-2599, (2012)
[36] Wu, S.; Manber, U., Fast text searching: allowing errors, Commun. ACM, 35, 10, 83-91, (1992)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.