×

zbMATH — the first resource for mathematics

Efficient pattern matching for RNA secondary structures. (English) Zbl 1331.92117
Summary: We propose efficient methods to address key pattern matching problems in RNA secondary structures using the notion of structural strings. A structural string (\(s\)-string) is composed of constant symbols and parameter symbols from the alphabets \(\Sigma\) and \(\Pi\), respectively. An individual symbol in the \(\Pi\) alphabet may be considered a complement of another unique symbol in \(\Pi\). The notion of matching constants, parameters, and complements is referred to as the structural matching (\(s\)-match) problem, which is helpful in matching RNA and previously, was solved by the structural suffix tree (sST). Other approaches to RNA matching that do not openly consider the \(s\)-match include the use of affix data structures. In this paper, we provide new data structures and algorithms to address the \(s\)-match problem. Specifically, we introduce the structural suffix array and structural longest common prefix array and then identify how to \(s\)-match with these data structures. Our new \(s\)-matching solution is then used as the framework to answer various combinatorial queries encountered in matching RNA secondary structures.

MSC:
92D20 Protein sequences, DNA sequences
68P05 Data structures
68W32 Algorithms on strings
Software:
CONTRAfold
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Adjeroh, D.; Bell, T.; Mukherjee, A., The Burrows-Wheeler transform: data compression, suffix arrays and pattern matching, (2008), Springer New York
[2] Adjeroh, D.; Nan, F., Suffix sorting via Shannon-Fano-elias codes, Algorithms, 3, 2, 145-167, (2010) · Zbl 06920517
[3] Allali, J.; Sagot, M., A new distance for high level RNA secondary structure comparison, EEE/ACM Trans. Comput. Biol. Bioinform., 2, 1, 3-14, (2005)
[4] Amir, A.; Farach, M.; Muthukrishnan, S., Alphabet dependence in parameterized matching, Inform. Process. Lett., 49, 111-115, (1994) · Zbl 0795.68077
[5] Backofen, R.; Siebert, S., Fast detection of common sequence structure patterns in rnas, J. Discrete Algorithms, 5, 2, 212-228, (2007) · Zbl 1121.92028
[6] Bafna, V.; Muthukrishnan, S.; Ravi, R., Computing similarity between RNA strings, (CPM, (1995)), 1-16
[7] Baker, B., Parameterized pattern matching by boyer-Moore-type algorithms, (SODA, (1995)), 541-550 · Zbl 0960.68587
[8] Baker, B., A theory of parameterized pattern matching: algorithms and applications, (Proceedings of STOC ’93, (1993), ACM New York), 71-80 · Zbl 1310.68098
[9] Batey, R.; Rambo, R.; Doudna, J., Tertiary motifs in RNA structure and folding, Angew. Chem. Int., 38, 2326-2343, (1999)
[10] Beal, R.; Adjeroh, D., Compressed parameterized pattern matching, (DCC’13, (2013)), 461-470
[11] Beal, R.; Adjeroh, D., P-suffix sorting as arithmetic coding, J. Discrete Algorithms, 16, 151-169, (2012) · Zbl 1257.68118
[12] Beal, R.; Adjeroh, D., Parameterized longest previous factor, Theoret. Comput. Sci., 437, 21-34, (2012) · Zbl 1247.68331
[13] Chen, H.; Condon, A.; Jababri, H., An O(\(n^5\)) algorithm for MFE prediction of kissing hairpins and 4-chains in nucleic acids, J. Comput. Biol., 16, 6, 803-815, (2009)
[14] Cole, R.; Hariharan, R., Faster suffix tree construction with missing suffix links, (STOC’00, (2000), ACM New York), 407-415 · Zbl 1296.68032
[15] Deguchi, S.; Higashijima, F.; Bannai, H.; Inenaga, S.; Takeda, M., Parameterized suffix arrays for binary strings, (PSC, Czech Republic, (2008)), 84-94
[16] Devroye, L.; Szpankowski, W.; Rais, B., A note on the height of suffix trees, SIAM J. Comput., 21, 48-53, (1992) · Zbl 0743.68073
[17] Do, C.; Woods, D.; Batzoglou, S., Contrafold: RNA secondary structure prediction without physics-based models, Bioinformatics, 22, 14, 90-98, (2006), (ISMB supplement)
[18] Gramm, J.; Guo, J.; Niedermeier, R., Pattern matching for arc-annotated sequences, ACM Trans. Algorithms, 2, 1, 44-65, (2006) · Zbl 1321.68552
[19] Gusfield, D., Algorithms on strings, trees and sequences: computer science and computational biology, (1997), Cambridge University Press Cambridge, UK · Zbl 0934.68103
[20] Heyne, S.; Will, S.; Beckstette, M.; Backofen, R., Lightweight comparison of RNAs based on exact sequence-structure matches, Bioinformatics, 25, 16, 2095-2102, (2009)
[21] Hofacker, I.; Schuster, P.; Stadler, P., Combinatorics of RNA secondary structures, Discrete Appl. Math., 88, 1-3, 207-237, (1998) · Zbl 0918.05004
[22] I, T.; Deguchi, S.; Bannai, H.; Inenaga, S.; Takeda, M., Lightweight parameterized suffix array construction, (IWOCA’09, LNCS, vol. 5874, (2009), Springer Heidelberg), 312-323 · Zbl 1267.68330
[23] Jiang, T.; Lin, G.; Ma, B.; Zhang, K., The longest common subsequence problem for arc-annotated sequences, J. Discrete Algorithms, 2, 2, 257-270, (2004) · Zbl 1118.68756
[24] Karlin, S.; Ghandour, G.; Ost, F.; Tavare, S.; Korn, L., New approaches for computer analysis of nucleic acid sequences, Proc. Natl. Acad. Sci. USA, 80, 18, 5660-5664, (1983) · Zbl 0517.92013
[25] Kasai, T.; Lee, G.; Arimura, H.; Arikawa, S.; Park, K., Linear-time longest-common-prefix computation in suffix arrays and its applications, (CPM’01, LNCS, vol. 2089, (2001)), 181-192 · Zbl 0990.68639
[26] Kosaraju, S., Faster algorithms for the construction of parameterized suffix trees, (FOCS’95, (1995), ACM Washington, DC), 631-637 · Zbl 0938.68918
[27] Lee, T.; Na, J.; Park, K., On-line construction of parameterized suffix trees, (SPIRE’09, (2009), Springer Heidelberg), 31-38
[28] Lee, T.; Na, J.; Park, K., On-line construction of parameterized suffix trees for large alphabets, Inform. Process. Lett., 111, 5, 201-207, (2011) · Zbl 1260.68459
[29] Maass, M., Linear bidirectional on-line construction of affix trees, Algorithmica, 37, 320-334, (2003) · Zbl 0964.68512
[30] Manber, U.; Myers, G., Suffix arrays: a new method for on-line string searches, SIAM J. Comput., 22, 935-948, (1993) · Zbl 0784.68027
[31] Mauri, G.; Pavesi, G., Algorithms for pattern matching and discovery in RNA secondary structure, Theoret. Comput. Sci., 335, 1, 29-51, (2005) · Zbl 1080.68098
[32] Pedersen, J.; Bejerano, G.; Siepel, A.; Rosenbloom, K.; Lindblad-Toh, K.; Lander, E.; Kent, J.; Miller, W.; Haussler, D., Identification and classification of conserved RNA secondary structures in the human genome, PLoS Comput. Biol., 2, 4, 0251-0262, (2006)
[33] Penner, R.; Waterman, M., Spaces of RNA secondary structures, Adv. Math., 101, 1, 31-49, (1993) · Zbl 0810.92011
[34] Puglisi, S.; Smyth, W.; Turpin, A., A taxonomy of suffix array construction algorithms, ACM Comput. Surv., 39, 2, 1-31, (2007)
[35] Rabani, M.; Kertesz, M.; Segal, E., Computational prediction of RNA structural motifs involved in posttranscriptional regulatory processes, Proc. Natl. Acad. Sci. USA, 105, 39, 14885-14890, (2008)
[36] Rinn, J.; Chang, H., Genome regulation by long noncoding rnas, Annu. Rev. Biochem., 81, 145-166, (2012)
[37] Shibuya, T., Generalization of a suffix tree for RNA structural pattern matching, Algorithmica, 39, 1, 1-19, (2004) · Zbl 1089.68031
[38] Strothmann, D., The affix array data structure and its applications to RNA secondary structure analysis, Theoret. Comput. Sci., 389, 1-2, 278-294, (2007) · Zbl 1146.68358
[39] Svoboda, P.; Di Cara, A.; Hairpin, RNA: a secondary structure of primary importance, Cell. Mol. Life Sci., 63, 901-918, (2006)
[40] Wan, Y.; Kertesz, M.; Spitale, R.; Segal, E.; Chang, H., Understanding the transcriptome through RNA structure, Nat. Rev., 12, 641-655, (2011)
[41] Wang, Z.; Zhang, K., Multiple RNA structure alignment, J. Bioinform. Comput. Biol., 3, 3, 609-626, (2005)
[42] Xu, Y.; Wang, L.; Zhao, H.; Li, J., Exact matching of RNA secondary structure patterns, Theoret. Comput. Sci., 335, 1, 53-66, (2005) · Zbl 1080.68023
[43] Zhang, K.; Wang, L.; Ma, B., Computing similarity between RNA structures, (CPM, (1999)), 281-293 · Zbl 1063.68625
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.