zbMATH — the first resource for mathematics

Filtering degenerate patterns with application to protein sequence analysis. (English) Zbl 1461.92072
Summary: In biology, the notion of degenerate pattern plays a central role for describing various phenomena. For example, protein active site patterns, like those contained in the PROSITE database, e.g., \([FY]DPC[LIM][ASG]C[ASG]\), are, in general, represented by degenerate patterns with character classes. Researchers have developed several approaches over the years to discover degenerate patterns. Although these methods have been exhaustively and successfully tested on genomes and proteins, their outcomes often far exceed the size of the original input, making the output hard to be managed and to be interpreted by refined analysis requiring manual inspection. In this paper, we discuss a characterization of degenerate patterns with character classes, without gaps, and we introduce the concept of pattern priority for comparing and ranking different patterns. We define the class of underlying patterns for filtering any set of degenerate patterns into a new set that is linear in the size of the input sequence. We present some preliminary results on the detection of subtle signals in protein families. Results show that our approach drastically reduces the number of patterns in output for a tool for protein analysis, while retaining the representative patterns.

92D20 Protein sequences, DNA sequences
Full Text: DOI
[1] Hulo, N.; Bairoch, A.; Bulliard, V.; Cerutti, L.; Cuche, B.; de Castro, E.; Lachaize, C.; Langendijk-Genevaux, P.; Sigrist, C.; The 20 years of PROSITE; Nucleic Acids Res.: 2008; Volume 36 ,D245-D249.
[2] Parida, L.; ; Pattern Discovery in Bioinformatics: Theory and Algorithms: Boca Raton, FL, USA 2007; . · Zbl 1302.92006
[3] Jensen, K.L.; Styczynski, M.P.; Rigoutsos, I.; Stephanopoulos, G.N.; A generic motif discovery algorithm for sequential data; Bioinformatics: 2006; Volume 22 ,21-28.
[4] Abrahamson, K.; Generalized string matching; SIAM J. Comput.: 1987; Volume 16 ,1039-1051. · Zbl 0646.68079
[5] Navarro, G.; Raffinot, M.; Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching; J. Comput. Biol.: 2003; Volume 10 ,903-923.
[6] Fredriksson, K.; Grabowski, S.; Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance; Inf. Retr.: 2008; Volume 11 ,335-357.
[7] Wu, S.; Manber, U.; Fast text searching: Allowing errors; Commun. ACM: 1992; Volume 35 ,83-91.
[8] Soldano, H.; Viari, A.; Champesme, M.; Searching for flexible repeated patterns using a non-transitive similarity relation; Pattern Recognit. Lett.: 1995; Volume 16 ,233-246.
[9] Pisanti, N.; Soldano, H.; Carpentier, M.; Incremental inference of relational motifs with a degenerate Alphabet; Lect. Notes Comput. Sci.: 2005; Volume 3537 ,229-240. · Zbl 1131.68496
[10] Frith, M.C.; Saunders, N.F.W.; Kobe, B.; Bailey, T.L.; Discovering sequence motifs with arbitrary insertions and deletions; PLoS Comput. Biol.: 2008; Volume 4 .
[11] Sinha, S.; Tompa, M.; Discovery of novel transcription factor binding sites by statistical overrepresentation; Nucleic Acids Res.: 2002; Volume 30 ,5549-5560.
[12] Apostolico, A.; Comin, M.; Parida, L.; VARUN: Discovering extensible motifs under saturation constraints; IEEE/ACM Trans. Comput. Biol. Bioinforma.: 2010; Volume 7 ,752-762.
[13] Pisanti, N.; Crochemore, M.; Grossi, R.; Sagot, M.F.; Bases of motifs for generating repeated patterns with wild cards; IEEE/ACM Trans. Comput. Biol. Bioinforma.: 2005; Volume 2 ,40-50.
[14] Apostolico, A.; Comin, M.; Parida, L.; Bridging lossy and lossless compression by motif pattern discovery; Lect. Notes Comput. Sci.: 2006; Volume 4123 ,793-813. · Zbl 1158.68389
[15] Apostolico, A.; Comin, M.; Parida, L.; Motifs in Ziv-Lempel-Welch Clef; Proceedings of IEEE DCC Data Compression Conference: ; ,72-81.
[16] Apostolico, A.; Comin, M.; Parida, L.; Mining, compressing and classifying with extensible motifs; Algorithms Mol. Biol.: 2006; Volume 1 . · Zbl 1158.68389
[17] Comin, M.; Verzotto, D.; The Irredundant Class method for remote homology detection of protein sequences; J. Comput. Biol.: 2011; Volume 18 ,1819-1829.
[18] Comin, M.; Verzotto, D.; Classification of protein sequences by means of irredundant patterns; BMC Bioinforma.: 2010; Volume 11 . · Zbl 1461.92072
[19] Comin, M.; Verzotto, D.; Alignment-Free phylogeny of whole genomes using underlying subwords; BMC Algorithms Mol. Biol.: 2012; Volume 7 .
[20] Comin, M.; Parida, L.; Detection of subtle variations as consensus motifs; Theory Comput. Sci.: 2008; Volume 395 ,158-170. · Zbl 1142.68063
[21] Comin, M.; Parida, L.; Subtle Motif Discovery for Detection of Dna Regulatory Sites; Proceedings of the 5th Asia-Pacific Bioinformatics Conference, APBC: ; Volume Volume 5 ,27-36.
[22] Jensen, K.L.; Styczynski, M.P.; Rigoutsos, I.; Stephanopoulos, G.N.; A generic motif discovery algorithm for sequential data; Bioinformatics: 2006; Volume 22 ,21-28.
[23] Leslie, C.S.; Eskin, E.; Cohen, A.; Weston, J.; Noble, W.S.; Mismatch string kernels for discriminative protein classification; Bioinformatics: 2004; Volume 20 ,467-476.
[24] Dipartimento Di Ingegneria Dell’Informazione; ; .
[25] Apostolico, A.; Comin, M.; Parida, L.; Conservative extraction of over-represented extensible motifs; Bioinformatics: 2005; Volume 21 ,9-18.
[26] Mendes, N.D.; Casimiro, A.C.; Santos, P.M.; Sá-Correia, I.; Oliveira, A.L.; Freitas, A.T.; MUSA: A parameter free algorithm for the identification of biologically significant motifs; Bioinformatics: 2006; Volume 22 ,2996-3002.
[27] Peng, C.H.; Hsu, J.T.; Chung, Y.S.; Lin, Y.J.; Chow, W.Y.; Hsu, D.F.; Tang, C.Y.; Identification of degenerate motifs using position restricted selection and hybrid ranking combination; Nucleic Acids Res.: 2006; Volume 34 ,6379-6391.
[28] Vishnevsky, O.V.; Kolchanov, N.A.; ARGO: A web system for the detection of degenerate motifs and large-scale recognition of eukaryotic promoters; Nucleic Acids Res.: 2005; Volume 33 ,W417-W422.
[29] Chakravarty, A.; Carlson, J.M.; Khetani, R.S.; DeZiel, C.E.; Gross, R.H.; SPACER: Identification of cis-regulatory elements with non-contiguous critical residues; Bioinformatics: 2007; Volume 23 ,1029-1031.
[30] Wu, R.; Chaivorapol, C.; Zheng, J.; Li, H.; Liang, S.; fREDUCE: Detection of degenerate regulatory elements using correlation with expression; BMC Bioinforma.: 2007; Volume 8 .
[31] Wang, G.; Yu, T.; Zhang, W.; WordSpy: Identifying transcription factor binding motifs by building a dictionary and learning a grammar; Nucleic Acids Res.: 2005; Volume 33 ,W412-W416.
[32] Ukkonen, E.; Maximal and minimal representations of gapped and non-gapped motifs of a string; Theoret. Comput. Sci.: 2009; Volume 410 ,4341-4349. · Zbl 1187.68187
[33] Romer, K.; Kayombya, G.R.; Fraenkel, E.; WebMOTIFS: Automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches; Nucleic Acids Res.: 2007; Volume 35 ,W217-W220.
[34] Zhang, S.; Su, W.; Yang, J.; ARCS-Motif: Discovering correlated motifs from unaligned biological sequences; Bioinformatics: 2009; Volume 25 ,183-189.
[35] Coatney, M.; Parthasarathy, S.; MotifMiner: A General Toolkit for Efficiently Identifying Common Substructures in Molecules; Proceedings of the 3rd IEEE BIBE: ; ,336-340.
[36] Wijaya, E.; Yiu, S.M.; Son, N.T.; Kanagasabai, R.; Sung, W.K.; MotifVoter: A novel ensemble method for fine-grained integration of generic motif finders; Bioinformatics: 2008; Volume 24 ,2288-2295.
[37] Tompa, M.; Li, N.; Bailey, T.L.; Church, G.M.; Church, G.M.; Moor, B.D.; Eskin, E.; Favorov, A.V.; Frith, M.C.; Fu, Y.; Kent, W.J.; Assessing computational tools for the discovery of transcription factor binding sites; Nat. Biotechnol.: 2005; Volume 23 ,137-144.
[38] Edwards, R.J.; Davey, N.E.; Shields, D.C.; CompariMotif: Quick and easy comparisons of sequence motifs; Bioinformatics: 2008; Volume 24 ,1307-1309.
[39] Jiang, H.; Zhao, Y.; Chen, W.; Zheng, W.; Searching Maximal Degenerate Motifs Guided by a Compact Suffix Tree; Advances in Computational Biology: Berlin/Heidelberg, Germany 2010; Volume Volume 680 ,19-26.
[40] Edelman, G.M.; Gally, J.A.; Degeneracy and complexity in biological systems; Proc. Natl. Acad. Sci. USA: 2001; Volume 98 ,13763-13768.
[41] Shinozaki, D.; Akutsu, T.; Maruyama, O.; Finding optimal degenerate patterns in DNA sequences; Bioinformatics: 2003; Volume 19 ,206-214.
[42] Bailey, T.L.; Williams, N.; Misleh, C.; Li, W.W.; MEME: Discovering and analyzing DNA and protein sequence motifs; Nucleic Acids Res.: 2006; Volume 34 ,369-373.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.