×

An efficient algorithm to identify DNA motifs. (English) Zbl 1319.68261

Summary: We consider the problem of identifying motifs that abstracts the task of finding short conserved sites in genomic DNA. The planted \((l, d)\)-motif problem, PMP, is the mathematical abstraction of this problem, which consists of finding a substring of length \(l\) that occurs in each \(s_{i}\) in a set of input sequences \(S~=~\{s_{1}\), \(s_{2}, \dots, s_{t}\}\) with at most \(d\) substitutions. Our propose algorithm combines the voting algorithm and pattern matching algorithm to find exact motifs. The combined algorithm is achieved by running the voting algorithm on \(t'\) sequences, \(t'<t\). After that we use the pattern matching on the output of the voting algorithm and the reminder sequences, \(t - t'\). Two values of \(t'\) are calculated. The first value of \(t'\) makes the running time of our proposed algorithm less than the running time of voting algorithm. The second value of \(t'\) makes the running time of our proposed algorithm is minimal. We show that our proposed algorithm is faster than the voting algorithm by testing both algorithms on simulated data from \((9, d \leq 2)\) to \((19, d \leq 7)\). Finally, we test the performance of the combined algorithm on realistic biological data.

MSC:

68W32 Algorithms on strings
92D20 Protein sequences, DNA sequences

Software:

PMS5; TEIRESIAS; TRANSFAC
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abbas M., Abouelhoda M., Bahig H.: A hybrid method for the exact planted (l, d)-motif finding problem and its parallelization. BMC Bioinforma. 13(Suppl. 17), S10 (2012) · doi:10.1186/1471-2105-13-S17-S10
[2] Bailey T., Elkan C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach. Learn. 21, 51–80 (1995)
[3] Bandyopadhyay, S., Sahni, S., Rajasekaran, S.: PMS6: A faster algorithm for motif discovery. In: Proc. ICCABS 2012, pp. 1–6 (2012)
[4] Blanchette, M.: Algorithms for phylogenetic footprinting. In: Proc. RECOMB’01, pp. 49–58 (2001)
[5] Blanchette M., Tompa M.: Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12(5), 739–748 (2002) · doi:10.1101/gr.6902
[6] Brazma A., Jonassen I., Vilo J., Ukkonen E.: Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 15, 1202–1215 (1998)
[7] Buhler J., Tompa M.: Finding motifs using random projections. J. Comput. Biol. 9(2), 225–242 (2002) · doi:10.1089/10665270252935430
[8] Chin, F., Leung, H.: Voting algorithms for discovering long motifs. In: Proc. APBC 2005, pp. 261–271 (2005)
[9] Davila, J., Balla, S., Rajasekaran, S.: Space and time efficient algorithms for planted motif search. In: Proc. IWBRA 2006, LNCS 3992, pp. 822–829 (2006) · Zbl 1155.92313
[10] Davila J., Balla S., Rajasekaran S.: Fast and practical algorithms for planted (l, d)-motif search. IEEE/ACM Trans. Comput. Biol. Bioinforma. 4(4), 544–552 (2007) · Zbl 05335801 · doi:10.1109/TCBB.2007.70241
[11] Dinh H., Rajasekaran S., Kundeti V.: PMS5: an efficient exact algorithm for the (l, d)-motif finding problem. BMC Bioinforma. 12, 410–420 (2011) · doi:10.1186/1471-2105-12-410
[12] Galas D., Eggert M., Waterman M.: Rigorous pattern-recognition methods for DNA sequences: analysis of promoter sequences from Escherichia coli. J. Mol. Biol. 186(1), 117–128 (1985) · doi:10.1016/0022-2836(85)90262-1
[13] Hertz G., Stormo G.: Identification of consensus patterns in unaligned DNA and protein sequences: a large-deviation statistical basis for penalizing gaps. In: Lim, H., Cantor, C. (eds.) Bioinformatics and Genome Research, pp. 201–216. World Scientific, Singapore (1995)
[14] Lawrence C., Altschul S., Boguski M., Liu J., Neuwald A., Wootton J.: Detecting subtule sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993) · doi:10.1126/science.8211139
[15] Leung, H., Chin, F.: Finding exact optimal motif in matrix representation by partitioning. Bioinformatics 21(Supp. 2), ii-86–ii92 (2005)
[16] Ono, H., Ng, Y.: Best fiting-length substring patterns for a set of string. In: Proc. COCOON 2005, LNCS 3595, pp. 240–250 (2005) · Zbl 1128.68365
[17] Pevzner, P., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. ISMB 2000. The AAAI Press, Menlo Park, pp. 269–278 (2000)
[18] Rajasekaran S.: Algorithms for motif search. In: Aluru, S. (ed.) Handbook of Computational Molecular Biology, pp. 37-1–37-21. Chapman and Hall/CRC, London (2006)
[19] Rajasekaran S., Balla S., Huang C.-H.: Exact algorithms for planted motif problems. J. Comput. Biol. 12(8), 1117–1128 (2005) · doi:10.1089/cmb.2005.12.1117
[20] Rigoutsos I., Floratos A.: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14(1), 55–67 (1998) · doi:10.1093/bioinformatics/14.1.55
[21] Sagot, M.: Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi, L., Moura, A. (eds.), Proc. Latin’98, LNCS 1380, pp 111–127 (1998)
[22] Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proc. ISMB 2000. The AAAI Press, Menlo Park, pp. 344–354 (2000)
[23] Tompa, M.: An exact method for finding short motifs in sequences with application to the ribosome binding site problem. In: Proc. ISMB 1999. The AAAI Press, Menlo Park, pp. 262–271 (1999)
[24] Wingender E., Dietze P., Karas H., Knppel R.: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 24(1), 238–241 (1996) · Zbl 05437329 · doi:10.1093/nar/24.1.238
[25] Zhu J., Zhang M.: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15(7–8), 607–611 (1999) · doi:10.1093/bioinformatics/15.7.607
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.