×

Interval-based distance function for identifying RNA structure candidates. (English) Zbl 1307.92306

Summary: Many clustering approaches have been developed for biological data analysis, however, the application of traditional clustering algorithms for RNA structure data analysis is still a challenging issue. This arises from the existence of complex secondary structures while clustering. One of the most critical issues of cluster analysis is the development of appropriate distance measures in high dimensional space. The traditional distance measures focus on scale issues, but ignores the correlation between two values. This article develops a novel interval-based distance (Hausdorff) measure for computing the similarity between characterized structures. Three relationships including perfect match, partially overlapped and non-overlapped are considered. Finally, we demonstrate the methods by analyzing a data set of RNA secondary structures from the Rfam database.

MSC:

92D20 Protein sequences, DNA sequences

Software:

Rfam
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Agrawal, R., Imielinski, T., Swami, A., 1993. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 207-216.; Agrawal, R., Imielinski, T., Swami, A., 1993. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 207-216.
[2] Arkhangel’skii, A. V.; Pontryagin, L. S., General topology I: basic concepts and constructions dimension theory, (Encyclopedia of Mathematical Sciences (1990), Springer)
[3] Chan, K., Fu, W., 1999. Efficient time series matching by wavelets. In: Proceedings of the 15th IEEE International Conference on Data Engineering, pp. 126-133.; Chan, K., Fu, W., 1999. Efficient time series matching by wavelets. In: Proceedings of the 15th IEEE International Conference on Data Engineering, pp. 126-133.
[4] Ester, M., Kriegel, H.P, Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 226-231.; Ester, M., Kriegel, H.P, Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 226-231.
[5] Faloutsos, C.; Barber, R.; Flickner, M.; Hafner, J.; Niblack, W.; Petkovic, D.; Equitz, W., Efficient and effective querying by image content, Journal of Intelligent Information Systems, 3, 3-4, 231-262 (1994)
[6] Francois, D., Christian, H., Kay, J.W., 2008. A growth model for RNA secondary structures. Journal of Statistical Mechanics: Theory and Experiment, 04008.; Francois, D., Christian, H., Kay, J.W., 2008. A growth model for RNA secondary structures. Journal of Statistical Mechanics: Theory and Experiment, 04008.
[7] Gardner, P. P.; Daub, J.; Tate, J. G.; Nawrocki, E. P.; Kolbe, D. L.; Lindgreen, S.; Wilkinson, A. C.; Finn, R. D.; Griffiths-Jones, S.; Eddy, S. R.; Bateman, A., Rfam: updates to the RNA families database, Nucleic Acids Research, 37, 136-140 (2009)
[8] Han, J. W.; Kamber, M., Data Mining: Concepts and Techniques (2006), Morgan Kaufmann Publishers · Zbl 1445.68004
[9] Heyer, L. J.; Kruglyak, S.; Yooseph, S., Exploring expression data: identification and analysis of coexpressed genes, Genome Research, 9, 1106-1115 (1999)
[10] \( \langle\) http://toolkit.tuebingen.mpg.de/blastclust \(\rangle \); \( \langle\) http://toolkit.tuebingen.mpg.de/blastclust \(\rangle \)
[11] Janssen, S.; Reeder, J.; Giegerich, R., Shape based indexing for faster search of RNA family databases, BMC Bioinformatics, 9, 131 (2008)
[12] Lawrence, H. (Ed.), 1993. Artificial Intelligence and Molecular Biology. MIT Press.; Lawrence, H. (Ed.), 1993. Artificial Intelligence and Molecular Biology. MIT Press.
[13] Lu, Y.; Lu, S. Y.; Fotouhi, F.; Deng, Y. P.; Brown, S., Incremental genetic K-means algorithm and its application in gene expression data analysis, BMC Bioinformatics, 5, 172 (2004)
[14] Mattick, J. S., RNA regulation: a new genetics, Nature Reviews Genetics, 5, 4, 316-323 (2004)
[15] Napthine, S.; Liphardt, J.; Bloys, A.; Routledge, S.; Brierley, I., The role of RNA pseidoknot stem 1 length in the promotion of efficient—1 ribosomal frameshifting, Journal of Molecular Biology, 288, 305-320 (1999)
[16] Ng, R.T., Han, J., 1994. Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th VLDB Conference, pp. 144-155.; Ng, R.T., Han, J., 1994. Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th VLDB Conference, pp. 144-155.
[17] Pang, K. C.; Stephen, S.; Dinger, M. E.; Engström, P. G.; Lenhard, B.; Mattick, J. S., RNAdb 2.0—an expanded database of mammalian non-coding RNAs, Nucleic Acids Research, 35, D178-D182 (2007)
[18] Panchenko, A. R.; Madej, T., Structural similarity of loops in protein families: toward the understanding of protein evolution, BMC Evolutionary Biology, 5, 1, 10 (2005)
[19] Searls, D. B., The language of genes, Nature, 420, 211-217 (2002)
[20] van Batenburg, F. H.; Gultyaev, A. P.; Pleij, C. W., PseudoBase: structural information on RNA pseudoknots, Nucleic Acids Research, 29, 1, 194-195 (2001)
[21] van Batenburg, F. H.; Gultyaev, A. P.; Pleij, C. W.; Ng, J.; Iliehoek, J., PseudoBase: a database with RNA pseudoknots, Nucleic Acids Research, 28, 1, 201-204 (2000)
[22] Xu, X.; Ji, Y.; Stormo, G. D., RNA Sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment, Bioinformatics, 23, 15, 1883-1891 (2007)
[23] Zakai, M., General distance criteria, IEEE Transactions on Information Theory, 10, 1, 94-95 (1964) · Zbl 0116.37605
[24] Zhang, S. J.; Haas, B.; Eskin, E.; Bafna, V., Searching genomes for noncoding RNA using FastR, IEEE/ACM Transaction on Computational Biology and Bioinformatics, 2, 4, 366-379 (2005)
[25] Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: An efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD Conference, pp. 103-114.; Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: An efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD Conference, pp. 103-114.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.