×

Identifying translation initiation sites in prokaryotes using support vector machine. (English) Zbl 1403.92066

Summary: Motivation: Gene identification in genomes has been a fundamental and long-standing task in bioinformatics and computational biology. Many computational methods have been developed to predict genes in prokaryote genomes by identifying translation initiation site (TIS) in transcript data. However, the pseudo-TISs at the genome level make these methods suffer from a high number of false positive predictions. In addition, most of the existing tools use an unsupervised learning framework, whose predictive accuracy may depend on the choice of specific organism.
Results: In this paper, we present a supervised learning method, support vector machine (SVM), to identify translation initiation site at the genome level. The features are extracted from the sequence data by modeling the sequence segment around predicted TISs as a position specific weight matrix (PSWM). We train the parameters of our SVM through well constructed positive and negative TIS datasets. Then we apply the method to recognize translation initiation sites in E. coli, B. subtilis, and validate our method on two GC-rich bacteria genomes: Pseudomonas aeruginosa and Burkholderia pseudomallei K96243. We show that translation initiation sites can be recognized accurately at the genome level by our method, irrespective of their GC content. Furthermore, we compare our method with four existing methods and demonstrate that our method outperform these methods by obtaining better performance in all the four organisms.

MSC:

92C40 Biochemistry, molecular biology
68T05 Learning and adaptive systems in artificial intelligence

Software:

GeneMarkS
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Besemer, J.; Lomsadze, A.; Borodovsky, M., Genemarks: a self-training method for prediction of gene starts in microbial genomes. implications for finding sequence motifs in regulatory regions, Nucleic acids research, 29, 2607, (2001)
[2] Bradley, A.P., The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern recognition, 30, 1145-1159, (1997)
[3] Burges, C.J.C., A tutorial on support vector machines for pattern recognition, Data mining and knowledge discovery, 2, 121-167, (1998)
[4] Deng, N.Y.; Tian, Y.J., A new method in data mining: support vector machine, (2004), Science Press Beijing
[5] Frishman, D.; Mironov, A.; Mewes, H.W.; Gelfand, M., Combining diverse evidence for gene recognition in completely sequenced bacterial genomes, Nucleic acids research, 26, 2941-2947, (1998)
[6] Gao, T., Tian, Y., Shao, X., Deng, N., 2008. Accurate prediction of translation initiation sites by Universum SVM. In: Proceedings of the Second International Symposium on Optimization and Systems Biology, vol. 9, Lijiang, pp. 275-282.; Gao, T., Tian, Y., Shao, X., Deng, N., 2008. Accurate prediction of translation initiation sites by Universum SVM. In: Proceedings of the Second International Symposium on Optimization and Systems Biology, vol. 9, Lijiang, pp. 275-282.
[7] Hanley, J.A.; McNeil, B.J., The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143, 29, (1982)
[8] Hu, G.Q.; Zheng, X.; Ju, L.N.; Zhu, H.; She, Z.S., Computational evaluation of TIS annotation for prokaryotic genomes, BMC bioinformatics, 9, 160, (2008)
[9] Makita, Y.; de Hoon, M.J.L.; Danchin, A., Hon-yaku: a biology-driven Bayesian methodology for identifying translation initiation sites in prokaryotes, BMC bioinformatics, 8, 47, (2007)
[10] Ou, H.Y.; Guo, F.B.; Zhang, C.T., GS-finder: a program to find bacterial gene start sites with a self-training method, International journal of biochemistry and cell biology, 36, 535-544, (2004)
[11] Qing, G.; Xia, B.; Inouye, M., Enhancement of translation initiation by A/T-rich sequences downstream of the initiation codon in Escherichia coli, Journal of molecular microbiology and biotechnology, 6, 133-144, (2003)
[12] Rocha, E.P.C.; Viari, A.; Danchin, A., Oligonucleotide bias in bacillus subtilis: general trends and taxonomic comparisons, Nucleic acids research, 26, 2971, (1998)
[13] Spackman, K.A., Signal detection theory: valuable tools for evaluating inductive learning, (1989), Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, pp. 160-163
[14] Suzek, B.E.; Ermolaeva, M.D.; Schreiber, M.; Salzberg, S.L., A probabilistic method for identifying start codons in bacterial genomes, Bioinformatics, 17, 1123-1130, (2001)
[15] Tech, M.; Meinicke, P., An unsupervised classification scheme for improving predictions of prokaryotic TIS, BMC bioinformatics, 7, 121, (2006)
[16] Vapnik, V., Estimation of dependences based on empirical data, (2006), Springer · Zbl 1118.62002
[17] Vapnik, V.N., The nature of statistical learning theory, (2000), Springer · Zbl 0934.62009
[18] Zhao, X.M.; Wang, Y.; Chen, L.; Aihara, K., Gene function prediction using labeled and unlabeled data, BMC bioinformatics, 9, 57, (2008)
[19] Zhao, X.M.; Li, X.; Chen, L.; Aihara, K., Protein classification with imbalanced data, Proteins: structure, function, and bioinformatics, 70, (2008)
[20] Zhu, H.; Hu, G.Q.; Yang, Y.F.; Wang, J.; She, Z.S., MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes, BMC bioinformatics, 8, 97, (2007)
[21] Zhu, H.Q.; Hu, G.Q.; Ouyang, Z.Q.; Wang, J.; She, Z.S., Accuracy improvement for identifying translation initiation sites in microbial genomes, Bioinformatics, 20, 3308-3317, (2004)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.