×

The exact multiple pattern matching problem solved by a reference tree approach. (English) Zbl 1517.68439

Summary: Given a text \(T\) and a set of \(r\) patterns \(P_1, P_2, \ldots, P_r\), the exact multiple pattern matching problem reports the ending positions of all occurrences of \(P_i\) in \(T\) for \(1 \leq i \leq r\). By transforming all substrings with a fixed length of \(T\) into a reference tree such that each internal node stores a reference string, the exact multiple pattern matching problem can be efficiently solved by searching patterns in the tree via the guidance of the reference strings. We design elegant algorithms to construct the reference tree (the preprocessing phase) and to search patterns in the tree (the searching phase) using bitwise operations. The experiments involving problem instances from the DNA sequence and the English language are conducted to compare the performance of our approach against those of the suffix tree and suffix array algorithms. The computational results demonstrate the advantage of our approach over these algorithms. In spite of the simplicity, our approach is quite efficient, flexible and robust.

MSC:

68W32 Algorithms on strings
68U15 Computing methodologies for text processing; mathematical typography
92D20 Protein sequences, DNA sequences
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abouelhoda, Mohamed I.; Kurtz, Stefan; Ohlebusch, Enno, Replacing suffix trees with enhanced suffix arrays, J. Discret. Algorithms, 2, 1, 53-86 (2004) · Zbl 1115.92303
[2] Agarwal, Rachit; Khandelwal, Anurag; Stoica, Ion, Succinct: enabling queries on compressed data, (12th USENIX Symposium on Networked Systems Design and Implementation. 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, Oakland, CA, USA, May 4-6, 2015 (2015)), 337-350
[3] Assefa, Samuel A.; Keane, Thomas M.; Otto, Thomas D.; Newbold, Chris; Berriman, Matthew, ABACAS: algorithm-based automatic contiguation of assembled sequences, Bioinformatics, 25, 15, 1968-1969 (2009)
[4] Bankevich, Anton; Nurk, Sergey; Antipov, Dmitry; Gurevich, Alexey A.; Dvorkin, Mikhail; Kulikov, Alexander S.; Lesin, Valery M.; Nikolenko, Sergey I.; Pham, Son K.; Prjibelski, Andrey D.; Pyshkin, Alex; Sirotkin, Alexander; Vyahhi, Nikolay; Tesler, Glenn; Alekseyev, Max A.; Pevzner, Pavel A., Spades: a new genome assembly algorithm and its applications to single-cell sequencing, J. Comput. Biol., 19, 5, 455-477 (2012)
[5] Chain, P. S.G.; Grafham, D. V.; Fulton, R. S.; FitzGerald, M. G.; Hostetler, J.; Muzny, D.; Ali, J.; Birren, B.; Bruce, D. C.; Buhay, C.; Cole, J. R.; Ding, Y.; Dugan, S.; Field, D.; Garrity, G. M.; Gibbs, R.; Graves, T.; Han, C. S.; Harrison, S. H.; Highlander, S.; Hugenholtz, P.; Khouri, H. M.; Kodira, C. D.; Kolker, E.; Kyrpides, N. C.; Lang, D.; Lapidus, A.; Malfatti, S. A.; Markowitz, V.; Metha, T.; Nelson, K. E.; Parkhill, J.; Pitluck, S.; Qin, X.; Read, T. D.; Schmutz, J.; Sozhamannan, S.; Sterk, P.; Strausberg, R. L.; Sutton, G.; Thomson, N. R.; Tiedje, J. M.; Weinstock, G.; Wollam, A.; Detter, J. C., Genome project standards in a new era of sequencing, Science, 326, 5950, 236-237 (2009)
[6] Claude, Francisco; Konow, Roberto; Navarro, Gonzalo, Efficient indexing and representation of web access logs, (String Processing and Information Retrieval - Proceedings of the 21st International Symposium. String Processing and Information Retrieval - Proceedings of the 21st International Symposium, SPIRE 2014, Ouro Preto, Brazil, October 20-22, 2014 (2014)), 65-76
[7] Crochemore, Maxime; Hancart, Christophe; Lecroq, Thierry, Algorithms on Strings (2007), Cambridge University Press · Zbl 1137.68060
[8] Alves da Louza, Felipe; Gog, Simon; Zanotto, Leandro; Araujo, Guido; Telles, Guilherme P., Parallel computation for the all-pairs suffix-prefix problem, (String Processing and Information Retrieval - Proceedings of the 23rd International Symposium. String Processing and Information Retrieval - Proceedings of the 23rd International Symposium, SPIRE 2016, Beppu, Japan, October 18-20, 2016 (2016)), 122-132 · Zbl 1397.68243
[9] Ferragina, Paolo; Manzini, Giovanni; Mäkinen, Veli; Navarro, Gonzalo, Compressed representations of sequences and full-text indexes, ACM Trans. Algorithms, 3, 2, 20 (2007) · Zbl 1321.68263
[10] Galardini, Marco; Biondi, Emanuele G.; Bazzicalupo, Marco; Mengoni, Alessio, Contiguator: a bacterial genomes finishing tool for structural insights on draft genomes, Source Code Biol. Med., 6, 11 (2011)
[11] Gog, Simon; Ohlebusch, Enno, Fast and lightweight lcp-array construction algorithms, (Proceedings of the Thirteenth Workshop on Algorithm Engineering and Experiments. Proceedings of the Thirteenth Workshop on Algorithm Engineering and Experiments, ALENEX 2011, Holiday Inn San Francisco Golden Gateway, San Francisco, California, USA, January 22, 2011 (2011)), 25-34 · Zbl 1430.68041
[12] Gog, Simon; Beller, Timo; Moffat, Alistair; Petri, Matthias, From theory to practice: plug and play with succinct data structures, (Experimental Algorithms - Proceedings of the 13th International Symposium. Experimental Algorithms - Proceedings of the 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29 - July 1, 2014 (2014)), 326-337
[13] Grossi, Roberto; Gupta, Ankur; Vitter, Jeffrey Scott, High-order entropy-compressed text indexes, (Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12-14, 2003, Baltimore, Maryland, USA (2003)), 841-850 · Zbl 1092.68584
[14] Hamming, Richard W., Error detecting and error correcting codes, Bell Syst. Tech. J., 29, 2, 147-160 (April 1950) · Zbl 1402.94084
[15] Warren, Henry S., Hacker’s Delight (2013), Pearson Education
[16] (Kao, Ming-Yang, Encyclopedia of Algorithms - 2016 Edition (2016), Springer) · Zbl 1341.68001
[17] Kasai, Toru; Lee, Gunho; Arimura, Hiroki; Arikawa, Setsuo; Park, Kunsoo, Linear-time longest-common-prefix computation in suffix arrays and its applications, (Combinatorial Pattern Matching, Proceedings of the 12th Annual Symposium. Combinatorial Pattern Matching, Proceedings of the 12th Annual Symposium, CPM 2001, Jerusalem, Israel, July 1-4, 2001 (2001)), 181-192 · Zbl 0990.68639
[18] Li, Zhize; Li, Jian; Huo, Hongwei, Optimal in-place suffix sorting, (2018 Data Compression Conference. 2018 Data Compression Conference, DCC 2018, Snowbird, UT, USA, March 27-30, 2018 (2018)), 422
[19] Liu, Linda; Li, Yinhu; Li, Shiliang; Hu, Ni; He, Yimin; Pong, Ray; Lin, Danni; Lu, Lihua; Law, Maggie, Comparison of next-generation sequencing systems, BioMed Res. Int., 2012, Article 251364 pp. (2012)
[20] Luo, Ruibang; Liu, Binghang; Xie, Yinlong; Li, Zhenyu; Huang, Weihua; Yuan, Jianying; He, Guangzhu; Chen, Yanxiang; Pan, Qi; Liu, Yunjie; Tang, Jingbo; Wu, Gengxiong; Zhang, Hao; Shi, Yujian; Liu, Yong; Yu, Chang; Wang, Bo; Lu, Yao; Han, Changlei; Cheung, David W.; Yiu, Siu-Ming; Peng, Shaoliang; Xiaoqian, Zhu; Liu, Guangming; Liao, Xiangke; Li, Yingrui; Yang, Huanming; Wang, Jian; Lam, Tak-Wah; Wang, Jun, SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler, GigaScience, 1, 1 (2012)
[21] Manber, Udi; Myers, Eugene W., Suffix arrays: a new method for on-line string searches, SIAM J. Comput., 22, 5, 935-948 (1993) · Zbl 0784.68027
[22] Mariano, Diego C.; Pereira, Felipe L.; Ghosh, Preetam; Barh, Debmalya; Figueiredo, Henrique C.; Silva, Artur; Ramos, Rommel T.; Azevedo, Vasco A., Maprepeat: an approach for effective assembly of repetitive regions in prokaryotic genomes, Bioinformation, 11, 6, 276-279 (2015)
[23] Muggli, Martin D.; Puglisi, Simon J.; Boucher, Christina, Efficient indexed alignment of contigs to optical maps, (Algorithms in Bioinformatics - Proceedings of the 14th International Workshop. Algorithms in Bioinformatics - Proceedings of the 14th International Workshop, WABI 2014, Wroclaw, Poland, September 8-10, 2014 (2014)), 68-81
[24] Navarro, Gonzalo; Raffinot, Mathieu, Flexible Pattern Matching in Strings - Practical On-line Search Algorithms for Texts and Biological Sequences (2002), Cambridge University Press · Zbl 0992.92029
[25] Ohlebusch, Enno; Fischer, Johannes; Gog, Simon, CST++, (String Processing and Information Retrieval - Proceedings of the 17th International Symposium. String Processing and Information Retrieval - Proceedings of the 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010 (2010)), 322-333
[26] Ohlebusch, Enno; Stauß, Stefan; Baier, Uwe, Trickier XBWT tricks, (String Processing and Information Retrieval - Proceedings of the 25th International Symposium. String Processing and Information Retrieval - Proceedings of the 25th International Symposium, SPIRE 2018, Lima, Peru, October 9-11, 2018 (2018)), 325-333 · Zbl 1518.68080
[27] Paulino, Daniel; Warren, René L.; Vandervalk, Benjamin P.; Raymond, Anthony; Jackman, Shaun D.; Birol, Inanç, Sealer: a scalable gap-closing application for finishing draft genomes, BMC Bioinform., 16, 230 (2015)
[28] Piro, Vitor; Faoro, Helisson; Weiss, Vinícius A.; Steffens, Maria; Pedrosa, Fabio; Souza, Emanuel; Raittz, Roberto, Fgap: an automated gap closing tool, BMC Res. Notes, 7, Article 371 pp. (2014)
[29] Poyias, Andreas; Raman, Rajeev, Improved practical compact dynamic tries, (String Processing and Information Retrieval - Proceedings of the 22nd International Symposium. String Processing and Information Retrieval - Proceedings of the 22nd International Symposium, SPIRE 2015, London, UK, September 1-4, 2015 (2015)), 324-336 · Zbl 1320.68019
[30] Rissman, Anna I.; Mau, Bob; Biehl, Bryan S.; Darling, Aaron E.; Glasner, Jeremy D.; Perna, Nicole T., Reordering contigs of draft genomes using the mauve aligner, Bioinformatics, 25, 16, 2071-2073 (2009)
[31] Russo, Luís M. S.; Navarro, Gonzalo; Oliveira, Arlindo L., Fully compressed suffix trees, ACM Trans. Algorithms, 7, 4, Article 53 pp. (2011) · Zbl 1295.68103
[32] Sadakane, Kunihiko, New text indexing functionalities of the compressed suffix arrays, J. Algorithms, 48, 2, 294-313 (2003) · Zbl 1100.68563
[33] Sadakane, Kunihiko, Compressed suffix trees with full functionality, Theory Comput. Syst., 41, 4, 589-607 (2007) · Zbl 1148.68015
[34] Fuentes Sepúlveda, José; Elejalde, Erick; Ferres, Leo; Seco, Diego, Efficient wavelet tree construction and querying for multicore architectures, (Experimental Algorithms - Proceedings of the 13th International Symposium. Experimental Algorithms - Proceedings of the 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29 - July 1, 2014 (2014)), 150-161
[35] Ukkonen, Esko, On-line construction of suffix trees, Algorithmica, 14, 3, 249-260 (1995) · Zbl 0831.68027
[36] Weiner, Peter, Linear pattern matching algorithms, (14th Annual Symposium on Switching and Automata Theory. 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973 (1973)), 1-11
[37] Zerbino, Daniel R.; Birney, Ewan, Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome Res., 18, 5, 821-829 (2008)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.