×

Mining approximate repeating patterns from sequence data with gap constraints. (English) Zbl 1235.68218

Summary: The rapid increase of available DNA, protein, and other biological sequences has made the problem of discovering meaningful patterns from sequences an important task for bioinformatics research. Among all types of patterns defined in the literature, the most challenging one is to find repeating patterns with gap constraints. In this article, we identify a new research problem for mining approximate repeating patterns (ARPs) with gap constraints, where the appearance of a pattern is subject to an approximate match, which is very common in biological sequences. To solve the problem, we propose an ArpGap (ARP mining with Gap constraints) algorithm with three major components for ARP mining: (1) a data-driven pattern generation approach to avoid generating unnecessary candidates for validation; (2) a back-tracking pattern search process to discover approximate occurrences of a pattern under user specified gap constraints; and (3) an Apriori-like deterministic pruning approach to progressively prune patterns and cease the search process if necessary. Experimental results on synthetic and real-world protein sequences assert that ArpGap is efficient in terms of memory consumption and computational cost. The results further suggest that the proposed method is practical for discovering approximate patterns for protein sequences where the sequence length is usually several hundreds to one thousand and the pattern length is relatively short.

MSC:

68T10 Pattern recognition, speech recognition
68W32 Algorithms on strings
92D20 Protein sequences, DNA sequences

Software:

REPuter
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Adebiyi, An efficient algorithm for finding short approximate non-tandem repeats, Bioinformatics 17 (Suppl 1) pp S5– (2001)
[2] Agrawal , R. R. Srikant 1994 Fast algorithms for mining association rules In Proceedings of the 20th International Conference Very Large Data Bases, Santiago de Chile, Chile, VLDB 1215 487 499
[3] Altschul, Basic local alignment search tool, Journal of Moledular Biology 215 (3) pp 403– (1990)
[4] Arimura, A polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence, International Symposium Algorithms and Computation pp 724– (2005) · Zbl 1175.68537
[5] Boeva, Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression, Bioinformatics 22 (6) pp 676– (2006)
[6] Chen, Efficient string matching with wildcards and length constraints, Knowledge and Information Systems 10 (4) pp 399– (2006)
[7] Chen, Mining frequent patterns in protein structures: a study of protease families, Bioinformatics 20 (Suppl 1) pp i77– (2004)
[8] Cuadrado, Evolutionary trends of different repetitive DNA sequences during speciation in the genus Secale, Journal of Heredity 93 (5) pp 339– (2002)
[9] Ding , B. D. Lo J. Han S. Khoo 2009 Efficient mining of closed repetitive gapped subsequences from a sequence database In Proceeding of the 25th IEEE International Conference on Data Engineering, Shanghai, China 1024 1035
[10] Eskin, Finding composite regulatory patterns in DNA sequences, Bioinformatics 18 (Suppl 1) pp S354– (2002)
[11] Exarchos, An optimized sequential pattern matching methodology for sequence classification, Knowledge and Information Systems 19 (2) pp 249– (2009) · Zbl 05685373
[12] Ferreira, Knowledge Discovery in Databases: PKDD 2005 Vol. 3721 pp 96– (2005) · Zbl 05342533
[13] GenBank. yeast (saccharomyces cerevisiae) 2010 http://www.ncbi.nlm.nih.gov/genbank
[14] Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (1997) · Zbl 0934.68103
[15] He , D. 2006 Using suffix tree to discover complex repetitive patterns in DNA sequences In Proceedings of the 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, New York 1 3474 3477
[16] He , D. X. Wu 2006 An efficient algorithm for finding approximate complex repetitive patterns In Proceedings of the IASTED International Conference on Computational and Systems Biology Acta Press Inc.
[17] He , D. X. Zhu X. Wu 2009 Approximate repeating pattern mining with gap requirements In Proceedings of the 2009 21st IEEE International Conference on Tools with Artificial Intelligence IEEE 17 24
[18] Herzel , H. O. Weiss E. Trifonov 1999 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding. (1999). Bioinformatics 15 3 187 193
[19] Jiang, vEye: behavioral footprinting for self-propagating worm detection and profiling, Knowledge and Information Systems 18 (2) pp 231– (2009) · Zbl 05536492
[20] Katti, Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications, PRS 9 (06) pp 1203– (2000)
[21] Kurtz, REPuter: fast computation of maximal repeats in complete genomes, Bioinformatics 15 (5) pp 426– (1999)
[22] Kurtz , S. E. Ohlebusch C. Schleiermacher J. Stoye R. Giegerich 2000 Computation and visualization of degenerate repeats in complete genomes In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, La Jolla, CA 8 228 238
[23] Kurtz, REPuter: the manifold applications of repeat analysis on a genomic scale. (2001), Nucleic Acids Research 29 (22) pp 4633– (2001)
[24] Leslie, Mismatch string kernels for SVM protein classification, Advances in Neural Information Processing Systems 15 pp 1441– (2003)
[25] Mitasiunaite , I. J. Boulicaut 2006 Looking for monotonicity properties of a similarity constraint on sequences In Proceedings of the 2006 ACM Symposium on Applied Computing 546 552
[26] Nature Reviews Microbiology Article Dataset 2006 http://www.psort.org/dataset/datasetnrm2006.html
[27] NCBI Basic Local Alignment Search Tool 2010 http://blast.ncbi.nlm.nih.gov/blast.cgi
[28] PAM250 Amino Acid Scoring Matrix 2010 http://prowl.rockefeller.edu/aainfo/pam250.htm
[29] Parida, Pattern Discovery in Bioinformatics: Theory & Algorithms (2008)
[30] Peter , A. 2003 The history of mutation pattern in human: a statistical analysis of repetitive sequences In American Physical Society, Annual APS March Meeting 2003, March 3-7, abstract #N10.003
[31] Price, De novo identification of repeat families in large genomes. (2005), Bioinformatics 21 (Suppl 1) pp i351– (2005)
[32] Prosite 2010 http://www.expasy.ch/prosite
[33] Research Collaboratory for Structural Bioinformatics (RCSB): Protein Data Bank 2010 http://www.rcsb.org/pdb/home/home.do
[34] Sagot , M. V. Escalier A. Viari H. Soldano 1998 Searching for repeated words in a text allowing for mismatches and gaps In Proceedings of the Second South American Workshop on String Processing, Viñas del Mar, Chile 87 100
[35] SVM 2010 http://www.support-vector-machines.org/
[36] Zhang , M. B. Kao D. Cheung K. Yip 2005 Mining periodic patterns with gap requirement from sequences In Proceedings of the Special Interest Group on Management of Data (SIGMOD), Baltimore, MD 623 633
[37] Zheng , J. S. Lonardi 2005 Discovery of repetitive patterns in DNA with accurate boundaries In Proceedings of the Fifth IEEE Symposium on Bioinformatics and Bioengineering, Minneapolis, MN 105 112
[38] Zhu , X. X. Wu 2007 Mining complex patterns across sequences with gap requirements In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07) 2934 2940
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.