×

zbMATH — the first resource for mathematics

A classification-based prediction model of messenger RNA polyadenylation sites. (English) Zbl 07020753
Summary: Messenger RNA polyadenylation is one of the essential processing steps during eukaryotic gene expression. The site of polyadenylation [(poly(A) site] marks the end of a transcript, which is also the end of a gene. A computation program that is able to recognize poly(A) sites would not only prove useful for genome annotation in finding genes ends, but also for predicting alternative poly(A) sites. Features that define the poly(A) sites can now be extracted from the poly(A) site datasets to build such predictive models. Using methods, including \(K\)-gram pattern, \(Z\)-curve, position-specific scoring matrix and first-order inhomogeneous Markov sub-model, numerous features were generated and placed in an original feature space. To select the most useful features, attribute selection algorithms, such as information gain and entropy, were employed. A training model was then built based on the Bayesian network to determine a subset of the optimal features. Test models corresponding to the training models were built to predict poly(A) sites in Arabidopsis and rice. Thus, a prediction model, termed Poly(A) site classifier, or PAC, was constructed. The uniqueness of the model lies in its structure in that each sub-model can be replaced or expanded, while feature generation, selection and classification are all independent processes. Its modular design makes it easily adaptable to different species or datasets. The algorithm’s high specificity and sensitivity were demonstrated by testing several datasets and, at the best combinations, they both reached 95%. The software package may be used for genome annotation and optimizing transgene structure.

MSC:
92D20 Protein sequences, DNA sequences
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Brent, M.R., Steady progress and recent breakthroughs in the accuracy of automated genome annotation, Nature review genetics, 9, 62-73, (2008)
[2] Chen, C.; Chen, L.X.; Zou, X.Y.; Cai, P.X., Prediction of protein secondary structure content by using the concept of Chou’s pseudo amino acid composition and support vector machine, Protein and peptide letters, 16, 27-31, (2009)
[3] Cheng, Y.; Miura, R.M.; Tian, B., Prediction of mrna polyadenylation sites by support vector machine, Bioinformatics, 22, 2320-2325, (2006)
[4] Chou, K.C., Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins—structure function and genetics, 43, 246-255, (2001)
[5] Chou, K.C., Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Current proteomics, 6, 262-274, (2009)
[6] Chou, K.C.; Elrod, D.W., Protein subcellular location prediction, Protein engineering, 12, 107-118, (1999)
[7] Chou, K.C.; Shen, H.B., Cell-ploc: a package of web servers for predicting subcellular localization of proteins in various organisms, Nature protocols, 3, 153-162, (2008)
[8] Chou, K.C.; Shen, H.B., Protident: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information, Biochemical and biophysical research communications, 376, 321-325, (2008)
[9] Chou, K.C.; Shen, H.B., Recent progress in protein subcellular location prediction, Analytical biochemistry, 370, 1-16, (2007)
[10] Chou, K.C.; Shen, H.B., Review: recent advances in developing web-servers for predicting protein attributes, Natural science, 2, 63-92, (2009)
[11] Chou, K.C.; Zhang, C.T., Prediction of protein structural classes, Critical reviews in biochemistry and molecular biology, 30, 275-349, (1995)
[12] Delaney, K.J.; Xu, R.; Zhang, J.; Li, Q.Q.; Yun, K.Y.; Falcone, D.L.; Hunt, A.G., Calmodulin interacts with and regulates the RNA-binding activity of an arabidopsis polyadenylation factor subunit, Plant physiology, 140, 1507-1521, (2006)
[13] Diehn, S.H.; Chiu, W.L.; De Rocher, E.J.; Green, P.J., Premature polyadenylation at multiple sites within a bacillus thuringiensis toxin gene-coding region, Plant physiology, 117, 1433-1443, (1998)
[14] Frey, L., Edgerton, M., Fisher, D. and Levy, S. 2007. Ensemble stump classifiers and gene expression signatures in lung cancer. In: Kuhn, K.A., Warren, J.R., Leong, T.Y. (Eds.), Medinfo 2007: Proceedings of the 12th World Congress on Health. IOS Press, Amsterdam, pp. 1255-1259.
[15] Friedman, N.; Geiger, D.; Goldszmidt, M., Bayesian network classifiers, Machine learning, 29, 131-163, (1997) · Zbl 0892.68077
[16] Georgiou, D.N.; Karakasidis, T.E.; Nieto, J.J.; Torres, A., Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition, Journal of theoretical biology, 257, 17-26, (2009) · Zbl 1400.92393
[17] Graber, J.H.; Cantor, C.R.; Mohr, S.C.; Smith, T.F., Genomic detection of new yeast pre-mrna 3′-end-processing signals, Nucleic acids research, 27, 888-894, (1999)
[18] Hajarnavis, A.; Korf, I.; Durbin, R., A probabilistic model of 3′ end formation in caenorhabditis elegans, Nucleic acids research, 32, 3392-3399, (2004)
[19] Hu, J.; Lutz, C.S.; Wilusz, J.; Tian, B., Bioinformatic identification of candidate cis-regulatory elements involved in human mrna polyadenylation, RNA, 11, 1485-1493, (2005)
[20] Ji, G.; Wu, X.; Zheng, J.; Shen, Y.; Li, Q.Q., Modeling plant mrna poly(A) sites: software design and implementation, Journal of computational theoretical nanoscience, 4, 1365-1368, (2007)
[21] Ji, G.; Zheng, J.; Shen, Y.; Wu, X.; Jiang, R.; Lin, Y.; Loke, J.C.; Davis, K.M., Predictive modeling of plant messenger RNA polyadenylation sites, BMC bioinformatics, 8, 43, (2007)
[22] Kedarisetti, K.D.; Kurgan, L.; Dick, S., Classifier ensembles for protein structural class prediction with varying homology, Biochemical and biophysical research communications, 348, 981-988, (2006)
[23] Koh, C.H.; Wong, L., Recognition of polyadenylation sites from arabidopsis genomic sequences, Genome information, 19, 73-82, (2007)
[24] Li, Q.Q.; Hunt, A.G., The polyadenylation of RNA in plants, Plant physiology, 115, 321-325, (1997)
[25] Liang, C.; Liu, Y.; Liu, L.; Davis, A.C.; Shen, Y.; Li, Q.Q., ESTs with cdna termini—previously overlooked resources for gene annotation and transcriptome exploration in chlamydomonas reinhardtii, Genetics, 179, 83-93, (2008)
[26] Liang, C., Wang, G., Liu, L., Ji, G., Liu, Y., Chen, J., Webb, J.S., Reese, G., Dean, J.F., 2007. WebTraceMiner: a web service for processing and mining EST sequence trace files. Nucleic Acids Research 35, W137-142.
[27] Lin, W.Z.; Xiao, X.; Chou, K.C., GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis, Protein engineering design & selection, 22, 699-705, (2009)
[28] Liu, H.; Han, H.; Li, J.; Wong, L., An in-silico method for prediction of polyadenylation signals in human sequences, Genome inform ser workshop genome inform, 14, 84-93, (2003)
[29] Loke, J.C.; Stahlberg, E.A.; Strenski, D.G.; Haas, B.J.; Wood, P.C.; Li, Q.Q., Compilation of mrna polyadenylation signals in arabidopsis revealed a new signal element and potential secondary structures, Plant physiology, 138, 1457-1468, (2005)
[30] Lutz, C.S., Alternative polyadenylation: a twist on mrna 3′ end formation, ACS chemical biology, 3, 609-617, (2008)
[31] Mitra, P.; Murthy, C.A.; Pal, S.K., Unsupervised feature selection using feature similarity, IEEE transactions on pattern analysis and machine intelligence, 24, 301-312, (2002)
[32] Quesada, V.; Dean, C.; Simpson, G.G., Regulated RNA processing in the control of arabidopsis flowering, International journal of developmental biology, 49, 773-780, (2005)
[33] Rabiner, L., A tutorial on hidden Markov models and selected application in speech recognition, Proceedings of the IEEE, 77, 257-286, (1989)
[34] Shen, H.B.; Chou, K.C., Ensemble classifier for protein fold pattern recognition, Bioinformatics, 22, 1717-1722, (2006)
[35] Shen, H.B.; Chou, K.C., A top-down approach to enhance the power of predicting human protein subcellular localization: hum-mploc 2.0, Analytical biochemistry, 394, 269-274, (2009)
[36] Shen, Y.; Ji, G.; Haas, B.J.; Wu, X.; Zheng, J.; Reese, G.J.; Li, Q.Q., Genome level analysis of Rice mrna 3′-end processing signals and alternative polyadenylation, Nucleic acids research, 36, 3150-3161, (2008)
[37] Shen, Y.; Liu, Y.; Liu, L.; Liang, C.; Li, Q.Q., Unique features of nuclear mrna poly(A) signals and alternative polyadenylation in chlamydomonas reinhardtii, Genetics, 179, 167-176, (2008)
[38] Witten, I.H.; Frank, E., Data mining: practical machine learning tools and techniques., (2005), Elsevier San Francisco · Zbl 1076.68555
[39] Xiao, X.; Lin, W.Z.; Chou, K.C., Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes, Journal of computational chemistry, 29, 2018-2024, (2008)
[40] Xiao, X.; Shao, S.; Ding, Y.; Huang, Z.; Chou, K.C., Using cellular automata images and pseudo amino acid composition to predict protein subcellular location, Amino acids, 30, 49-54, (2006)
[41] Xiao, X.; Shao, S.; Ding, Y.; Huang, Z.; Huang, Y.; Chou, K.C., Using complexity measure factor to predict protein subcellular location, Amino acids, 28, 57-61, (2005)
[42] Xiao, X.; Wang, P.; Chou, K.C., GPCR-CA: cellular automaton image approach for predicting G-protein-coupled receptor functional classes, Journal of computational chemistry, 30, 1414-1423, (2009)
[43] Xiao, X.; Wang, P.; Chou, K.C., Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image, Journal of theoretical biology, 254, 691-696, (2008) · Zbl 1400.92416
[44] Xing, D.; Zhao, H.; Xu, R.; Li, Q.Q., Arabidopsis PCFS4, a homologue of yeast polyadenylation factor pcf11p, regulates FCA alternative processing and promotes flowering time, Plant journal, 54, 899-910, (2008)
[45] Zeng, Y.H.; Guo, Y.Z.; Xiao, R.Q.; Yang, L.; Yu, L.Z.; Li, M.L., Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach, Journal of theoretical biology, 259, 366-372, (2009) · Zbl 1402.92193
[46] Zhang, C.-T.; Wang, J., Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve, Nucleic acids research, 28, 2804-2814, (2000)
[47] Zhang, H.; Lee, J.Y.; Tian, B., Biased alternative polyadenylation in human tissues, Genome biology, 6, R100, (2005)
[48] Zhang, T.L.; Ding, Y.S.; Chou, K.C., Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern, Journal of theoretical biology, 250, 186-193, (2008) · Zbl 1397.92551
[49] Zhou, G.P., An intriguing controversy over protein structural class prediction, Journal of protein chemistry, 17, 729-738, (1998)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.