Large-scale frequent stem pattern mining in RNA families. (English) Zbl 1406.92451

Summary: Functionally similar non-coding RNAs are expected to be similar in certain regions of their secondary structures. These similar regions are called common structure motifs, and are structurally conserved throughout evolution to maintain their functional roles. Common structure motif identification is one of the critical tasks in RNA secondary structure analysis. Nevertheless, current approaches suffer several limitations, and/or do not scale with both structure size and the number of input secondary structures. In this work, we present a method to transform the conserved base pair stems into transaction items and apply frequent itemset mining to identify common structure motifs existing in a majority of input structures. Our experimental results on telomerase and ribosomal RNA secondary structures report frequent stem patterns that are of biological significance. Moreover, the algorithms utilized in our method are scalable and frequent stem patterns can be identified efficiently among many large structures.


92D20 Protein sequences, DNA sequences
Full Text: DOI


[1] Agrawal, R.; Imieliński, T.; Swami, A., Mining association rules between sets of items in large databases, (Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., USA, (1993))
[2] Allali, J.; Sagot, M.-F., A multiple layer model to compare RNA secondary structures, Softw.: Pract. Experience, 38, 8, 775-792, (2008)
[3] Althaus, I. W., Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E, J. Biol. Chem., 268, March (9), 6119-6124, (1993)
[4] Althaus, I. W., The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase, J. Biol. Chem., 268, July (20), 14875-14880, (1993)
[5] Althaus, I. W., Kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-88204E, Biochemistry, 32, July (26), 6548-6554, (1993)
[6] Anderson, M. J., A new method for non-parametric multivariate analysis of variance, Austral Ecol., 26, 1, 32-46, (2001)
[7] Andronescu, M.; Bereg, V.; Hoos, H.; Condon, A., RNA STRAND: the RNA secondary structure and statistical analysis database, BMC Bioinformatics, 9, 1, 340, (2008)
[8] Bessho, Y., Structural basis for functional mimicry of long-variable-arm trna by transfer-messenger RNA, Proc. Natl. Acad. Sci., 104, May (20), 8293-8298, (2007)
[9] Bokov, K.; Steinberg, S. V., A hierarchical model for evolution of 23S ribosomal RNA, Nature, 457, February (7232), 977-980, (2009)
[10] Burdick, D.; Calimlim, M.; Gehrke, J., MAFIA: a maximal frequent itemset algorithm for transactional databases, (Proceedings of the 17th International Conference on Data Engineering, 2001., (2001)), 443-452
[11] Cannone, J., The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other rnas, BMC Bioinformatics, 3, 1, 2, (2002)
[12] Chen, J.-L.; Blasco, M. A.; Greider, C. W., Secondary structure of vertebrate telomerase RNA, Cell, 100, March (5), 503-514, (2000)
[13] Chen, J.-L.; Greider, C. W., Template boundary definition in Mammalian telomerase, Genes Dev., 17, November (22), 2747-2752, (2003)
[14] Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chou, K.-C., Irna-AI: identifying the adenosine to inosine editing sites in RNA sequences, Oncotarget, 8, 3, 4208-4217, (2017)
[15] Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chou, K.-C., Irna-3typea: identifying three types of modification at RNA’s adenosine sites, Mol. Ther. - Nucl. Acids, 11, 468-474, (2018)
[16] Chen, W.; Tang, H.; Ye, J.; Lin, H.; Chou, K.-C., Irna-pseu: identifying RNA pseudouridine sites, Mol. Ther. - Nucl. Acids, 5, (2016)
[17] Chen, W.; Yang, H.; Feng, P.; Ding, H.; Lin, H., Idna4mc: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties, Bioinformatics, 33, 22, 3518-3523, (2017)
[18] Chen, X.-X., Identification of bacterial cell wall lyases via pseudo amino acid composition, BioMed Res. Int., 2016, (2016), p. 81654623
[19] Cheng, X.; Xiao, X.; Chou, K.-C., Ploc-mgneg: predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general pseaac, Genomics, (2017), 2017/10/06/October
[20] Cheng, X.; Xiao, X.; Chou, K.-C., Ploc-mplant: predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general pseaac, Mol. Biosyst., 13, 9, 1722-1727, (2017)
[21] Cheng, X.; Xiao, X.; Chou, K.-C., Ploc-mvirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general pseaac, Gene, 628, September, 315-321, (2017), 2017/09/10/September
[22] Cheng, X.; Xiao, X.; Chou, K.-C., Ploc-meuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general pseaac, Genomics, 110, January (1), 50-58, (2018)
[23] Cheng, X.; Xiao, X.; Chou, K.-C., Ploc-mhum: predict subcellular localization of multi-location human proteins via general pseaac to winnow out the crucial GO information, Bioinformatics, 34, 9, 1448-1456, (2018)
[24] Cheng, X.; Zhao, S.-G.; Lin, W.-Z.; Xiao, X.; Chou, K.-C., Ploc-manimal: predict subcellular localization of animal proteins with both single and multiple sites, Bioinformatics, 33, 22, 3524-3531, (2017)
[25] Cheng, X.; Zhao, S.-G.; Xiao, X.; Chou, K.-C., Iatc-misf: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics, 33, 3, 341-346, (2017)
[26] Chiu, J. K.H.; Chen, Y.-P. P., Conformational features of topologically classified RNA secondary structures, PLoS One, 7, 7, (2012)
[27] Chiu, J. K.H.; Chen, Y.-P. P., Pairwise RNA secondary structure alignment with conserved stem pattern, Bioinformatics, 31, December (24), 3914-3921, (2015)
[28] Chiu, J. K.H.; Chen, Y.-P. P., A comprehensive study of RNA secondary structure alignment algorithms, Brief. Bioinform., 18, 2, 291-305, (2017)
[29] Chou, K. C., Graphic rules in steady and non-steady state enzyme kinetics, J. Biol. Chem., 264, July (20), 12074-12079, (1989)
[30] Chou, K.-C., Graphic rule for drug metabolism systems, Curr. Drug Metab., 11, 4, 369-378, (2010)
[31] Chou, K.-C., Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol., 273, March (1), 236-247, (2011) · Zbl 1405.92212
[32] Chou, K.-C., Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11, 3, 218-234, (2015)
[33] Chou, K.-C., An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Curr. Top. Med. Chem., 17, 21, 2337-2358, (2017)
[34] Chou, K. C.; Forsén, S., Graphical rules for enzyme-catalysed rate laws, Biochem. J., 187, 3, 829-835, (1980)
[35] Chou, K.-C.; Jiang, S.-P.; Liu, W.-M.; Fee, C.-H., Graph theory of enzyme kinetics: 1. steady-state reaction systems, Sci. Sin., 22, 3, 341-358, (1979) · Zbl 0399.92007
[36] Chou, K. C.; Kezdy, F. J.; Reusser, F., Kinetics of processive nucleic acid polymerases and nucleases, Anal. Biochem., 221, September (2), 217-230, (1994)
[37] Chou, K.-C.; Lin, W.-Z.; Xiao, X., Wenxiang: a web-server for drawing wenxiang diagrams, Nat. Sci., 3, 862-865, (2011)
[38] Chou, K.-C.; Shen, H.-B., Large-scale predictions of Gram-negative bacterial protein subcellular locations, J. Proteome Res., 5, December (12), 3420-3428, (2006)
[39] Chou, K.-C.; Shen, H.-B., Foldrate: a web-server for predicting protein folding rates from primary sequence, Open Bioinform. J., 3, 31-50, (2009)
[40] Chou, K.-C.; Shen, H.-B., Recent advances in developing web-servers for predicting protein attributes, Nat. Sci., 1, 02, 63, (2009)
[41] Cortes, C.; Vapnik, V., Support-vector networks, Mach. Learn., 20, September (3), 273-297, (1995), 01 · Zbl 0831.68098
[42] Durbin, R., Biological sequence analysis: probabilistic models of proteins and nucleic acids, (1998), Cambridge University Press · Zbl 0929.92010
[43] Eddy, S., A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure, BMC Bioinformatics, 3, 1, 18, (2002)
[44] Feng, P.; Ding, H.; Yang, H.; Chen, W.; Lin, H.; Chou, K.-C., Irna-psecoll: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into pseknc, Mol. Ther. - Nucl. Acids, 7, June, 155-163, (2017), 2017/06/16/
[45] Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chen, W.; Chou, K.-C., Idna6ma-pseknc: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into pseknc, Genomics, January, (2018), 2018/01/31/
[46] Gregory, S. T.; Dahlberg, A. E., Genetic and structural analysis of base substitutions in the central pseudoknot of thermus thermophilus 16S ribosomal RNA, RNA, 15, February (2), 215-223, (2009)
[47] Guignon, V.; Chauve, C.; Hamel, S., RNA strat: RNA structure analysis toolkit, (16th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB), (2008)), D31
[48] Hamada, M.; Tsuda, K.; Kudo, T.; Kin, T.; Asai, K., Mining frequent stem patterns from unaligned RNA sequences, Bioinformatics, 22, October (20), 2480-2487, (2006)
[49] Hinkley, C. S., The mouse telomerase RNA 5′-end lies just upstream of the telomerase template sequence, Nucleic Acids Res., 26, January (2), 532-536, (1998)
[50] Hochsmann, M.; Toller, T.; Giegerich, R.; Kurtz, S., Local similarity in RNA secondary structures, (Proceedings of the 2003 IEEE Computer Society Bioinformatics Conference, 11-14 August, 2003, (2003)), 159-168
[51] Hochsmann, M.; Voss, B.; Giegerich, R., Pure multiple RNA secondary structure alignments: a progressive profile approach, IEEE/ACM Trans. Comput. Biol. Bioinform., 1, 1, 53-62, (2004)
[52] Hofacker, I. L.; Bernhart, S. H.F.; Stadler, P. F., Alignment of RNA base pairing probability matrices, Bioinformatics, 20, September (14), 2222-2227, (2004)
[53] Holbrook, S. R., Structural principles from large rnas, Annu. Rev. Biophys., 37, 1, 445-464, (2008)
[54] Jády, B. E.; Bertrand, E.; Kiss, T., Human telomerase RNA and box H/ACA scarnas share a common cajal body-specific localization signal, J. Cell Biol., 164, March (5), 647-652, (2004)
[55] Kitahara, K.; Yasutake, Y.; Miyazaki, K., Mutational robustness of 16S ribosomal RNA, shown by experimental horizontal gene transfer in Escherichia coli, Proc. Natl. Acad. Sci., 109, November (47), 19220-19225, (2012)
[56] Lai, H. Y.; Chen, X. X.; Chen, W.; Tang, H.; Lin, H., Sequence-based predictive modeling to identify cancerlectins, Oncotarget, 8, April (17), 28169-28175, (2017), (in English)
[57] Lee, K.; Varma, S.; SantaLucia Jr, J.; Cunningham, P. R., In vivo determination of RNA structure-function relationships: analysis of the 790 loop in ribosomal RNA, J. Mol. Biol., 269, July (5), 732-743, (1997)
[58] Lin, H.; Deng, E.-Z.; Ding, H.; Chen, W.; Chou, K.-C., Ipro54-pseknc: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition, Nucleic Acids Res., 42, 21, 12961-12972, (2014)
[59] Lin, H.; Liang, Z. Y.; Tang, H.; Chen, W., Identifying sigma70 promoters with novel pseudo nucleotide composition, IEEE/ACM Trans. Comput. Biol. Bioinf., (2018), 1-1
[60] Lingner, J.; Hendrick, L. L.; Cech, T. R., Telomerase RNAs of different ciliates have a common secondary structure and a permuted template, Genes Dev., 8, August (16), 1984-1998, (1994)
[61] Liu, B.; Fang, L.; Liu, F.; Wang, X.; Chen, J.; Chou, K.-C., Identification of real microrna precursors with a pseudo structure status composition approach, PLoS One, 10, 3, (2015)
[62] Liu, B.; Fang, L.; Long, R.; Lan, X.; Chou, K.-C., Ienhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics, 32, 3, 362-369, (2016)
[63] Liu, B.; Wang, S.; Long, R.; Chou, K.-C., Irspot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics, 33, 1, 35-41, (2017)
[64] Liu, B.; Yang, F.; Chou, K.-C., 2L-pirna: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function, Mol. Ther. - Nucl. Acids, 7, June, 267-277, (2017)
[65] Liu, B.; Yang, F.; Huang, D.-S.; Chou, K.-C., Ipromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based pseknc, Bioinformatics, 34, 1, 33-40, (2018)
[66] Liu, L.-M.; Xu, Y.; Chou, K.-C., Ipgk-pseaac: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general pseaac, Med. Chem., 13, 6, 552-559, (2017)
[67] McClain, W. H.; Lai, L. B.; Gopalan, V., Trials, travails and triumphs: an account of RNA catalysis in rnase P, J. Mol. Biol., 397, February (3), 627-646, (2010)
[68] McCormick-Graham, M.; Romero, D. P., Ciliate telomerase RNA structural features, Nucleic Acids Res., 23, April (7), 1091-1097, (1995)
[69] McCormick-Graham, M.; Romero, D. P., A single telomerase RNA is sufficient for the synthesis of variable telomeric DNA repeats in ciliates of the genus paramecium, Mol. Cell. Biol., 16, April (4), 1871-1879, (1996)
[70] Pei, A., Diversity of 23S rrna genes within individual prokaryotic genomes, PLoS One, 4, 5, e5437, (2009)
[71] Pei, A. Y., Diversity of 16S rrna genes within individual prokaryotic genomes, Appl. Environ. Microbiol., 76, June (12), 3886-3897, (2010)
[72] Petrov, A. S., Secondary structures of rrnas from all three domains of life, PLoS One, 9, 2, e88222, (2014)
[73] Podlevsky, J. D.; Bley, C. J.; Omana, R. V.; Qi, X.; Chen, J. J.-L., The telomerase database, Nucleic Acids Res., 36, January (Suppl 1), D339-D343, (2008)
[74] Qiu, W.-R.; Jiang, S.-Y.; Xu, Z.-C.; Xiao, X.; Chou, K.-C., Irnam5C-psednc: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition, Oncotarget, 8, 25, 41178-41188, (2017)
[75] Qiu, W.-R.; Sun, B.-Q.; Xiao, X.; Xu, Z.-C.; Jia, J.-H.; Chou, K.-C., Ikcr-pseens: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier, Genomics, November, (2017), 2017/11/17/
[76] Richards, R. J.; Theimer, C. A.; Finger, L. D.; Feigon, J., Structure of the tetrahymena thermophila telomerase RNA helix II template boundary element, Nucleic Acids Res., 34, January (3), 816-825, (2006)
[77] Rødland, E. A., Pseudoknots in RNA secondary structures: representation, enumeration, and prevalence, J. Comput. Biol., 13, July (6), 1197-1213, (2006)
[78] Saitou, N.; Nei, M., The neighbor-joining method: a new method for reconstructing phylogenetic trees, Mol. Biol. Evol., 4, July (4), 406-425, (1987)
[79] Sakakibara, Y., Pair hidden Markov models on tree structures, Bioinformatics, 19, July (Suppl 1), i232-i240, (2003)
[80] Schirmer, S.; Giegerich, R., Forest alignment with affine gaps and anchors, (Combinatorial Pattern Matching, January 2011, Lecture Notes in Computer Science, 6661, (2011)), 104-117, (in English) · Zbl 1339.68212
[81] Shen, H.-B.; Chou, K.-C., Gneg-mploc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins, J. Theor. Biol., 264, May (2), 326-333, (2010)
[82] Song, J., Prosperous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy, Bioinformatics, 34, 4, 684-687, (2018)
[83] Srivastava, A.; Cai, L.; Mrázek, J.; Malmberg, R. L., Mutational patterns in RNA secondary structure evolution examined in three RNA families, PLoS One, 6, 6, e20484, (2011)
[84] Steinberg, S. V.; Boutorine, Y. I., G-ribo motif favors the formation of pseudoknots in ribosomal RNA, RNA, 13, July (7), 1036-1042, (2007)
[85] Tang, H.; Chen, W.; Lin, H., Identification of immunoglobulins using Chou’s pseudo amino acid composition with feature selection technique, Mol. Biosyst., 12, 4, 1269-1275, (2016)
[86] Theimer, C. A.; Blois, C. A.; Feigon, J., Structure of the human telomerase RNA pseudoknot reveals conserved tertiary interactions essential for function, Mol. Cell, 17, March (5), 671-682, (2005)
[87] Theimer, C. A.; Feigon, J., Structure and function of telomerase RNA, Curr. Opin. Struct. Biol., 16, 3, 307-318, (2006), 6//
[88] Tomita, E.; Sutani, Y.; Higashi, T.; Takahashi, S.; Wakatsuki, M., A simple and faster branch-and-bound algorithm for finding a maximum clique, (WALCOM: Algorithms and Computation, January 2010, Lecture Notes in Computer Science, 5942, (2010)), 191-203, (in English) · Zbl 1274.05455
[89] Ulyanov, N. B.; Shefer, K.; James, T. L.; Tzfati, Y., Pseudoknot structures with conserved base triples in telomerase RNAs of ciliates, Nucleic Acids Res., 35, September (18), 6150-6160, (2007)
[90] Wang-Ren, Q.; Bi-Qian, S.; Xuan, X.; Dong, X.; Kuo-Chen, C., Iphos-pseevo: identifying human phosphorylated proteins by incorporating evolutionary information into general pseaac via grey system theory, Mol. Inf., 36, 5-6, (2017)
[91] Will, S.; Reiche, K.; Hofacker, I. L.; Stadler, P. F.; Backofen, R., Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering, PLoS Comput. Biol., 3, 4, e65, (2007)
[92] Wu, Z.-C.; Xiao, X.; Chou, K.-C., 2D-MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids, J. Theor. Biol., 267, November (1), 29-34, (2010)
[93] Xiao, X.; Cheng, X.; Su, S.; Mao, Q.; Chou, K.-C., Ploc-mgpos: incorporate key gene ontology information into general pseaac for predicting subcellular localization of Gram-positive bacterial proteins, Nat. Sci., 9, 09, 330, (2017)
[94] Xiao, X.; Shao, S.-H.; Chou, K.-C., A probability cellular automaton model for hepatitis B viral infections, Biochem. Biophys. Res. Commun., 342, April (2), 605-610, (2006)
[95] Xiao, X.; Wu, Z.-C.; Chou, K.-C., A multi-label classifier for predicting the subcellular localization of Gram-negative bacterial proteins with both single and multiple sites, PLoS One, 6, 6, e20592, (2011)
[96] Xie, M.; Mosig, A.; Qi, X.; Li, Y.; Stadler, P. F.; Chen, J. J.-L., Structure and function of the smallest vertebrate telomerase RNA from teleost fish, J. Biol. Chem., 283, January (4), 2049-2059, (2008)
[97] Xu, Y.; Wang, Z.; Li, C.; Chou, K.-C., Ipreny-pseaac: identify C-terminal cysteine prenylation sites in proteins by incorporating two tiers of sequence couplings into pseaac, Med. Chem., 13, 6, 544-551, (2017)
[98] Yang, H., Identification of secretory proteins in mycobacterium tuberculosis using pseudo amino acid composition, BioMed Res. Int., 2016, (2016), p. 75413903
[99] Yao, Z.; Weinberg, Z.; Ruzzo, W. L., Cmfinder—a covariance model based RNA motif finding algorithm, Bioinformatics, 22, February (4), 445-452, (2006)
[100] Zhao, Y.-W.; Su, Z.-D.; Yang, W.; Lin, H.; Chen, W.; Tang, H., Ionchanpred 2.0: a tool to predict ion channels and their types, Int. J. Mol. Sci., 18, 9, 1838, (2017)
[101] Zhou, G.-P., The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism, J. Theor. Biol., 284, June (1), 142-148, (2011) · Zbl 1397.92245
[102] Zhou, G. P.; Deng, M. H., An extension of Chou’s graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways, Biochem. J., 222, 1, 169-176, (1984)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.