zbMATH — the first resource for mathematics

Robust feature generation for protein subchloroplast location prediction with a weighted GO transfer model. (English) Zbl 1412.92187
Summary: Chloroplasts are crucial organelles of green plants and eukaryotic algae since they conduct photosynthesis. Predicting the subchloroplast location of a protein can provide important insights for understanding its biological functions. The performance of subchloroplast location prediction algorithms often depends on deriving predictive and succinct features from genomic and proteomic data. In this work, a novel weighted gene ontology (GO) transfer model is proposed to generate discriminating features from sequence data and GO categories. This model contains two components. First, we transfer the GO terms of the homologous protein, and then assign the bit-score as weights to GO features. Second, we employ term-selection methods to determine weights for GO terms. This model is capable of improving prediction accuracy due to the tolerance of the noise derived from homolog knowledge transfer. The proposed weighted GO transfer method based on bit-score and a logarithmic transformation of CHI-square (WS-LCHI) performs better than the baseline models, and also outperforms the four off-the-shelf subchloroplast prediction methods.
92C80 Plant biology
92C40 Biochemistry, molecular biology
Full Text: DOI
[1] Aha, D.; Kibler, D., Instance-based learning algorithms, Mach. Learn., 6, 37-66, (1991)
[2] Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J. H.; Zhang, Z.; Miller, W.; Lipman, D. J., Gapped BLAST and PSI-blasta new generation of protein database search programs, Nucl. Acids Res., 25, 17, 3389-3402, (1997)
[3] Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M. C.; Estreicher, A.; Gasteiger, E.; Martin, M. J.; Michoud, K.; O’Donovan, C.; Phan, I., The SWISS-PROT protein knowledgebase and its supplement trembl in 2003, Nucl. Acids Res., 33, 1, 451-454, (2003)
[4] Camon, E.; Magrane, M.; Barrell, D.; Lee, V.; Dimmer, E.; Maslen, J.; Binns, D.; Harte, N.; Lopez, R.; Apweiler, R., The gene ontology annotation (GOA) databasesharing knowledge in uniprot with gene ontology, Nucl. Acids Res., 32, Suppl. 1, D262-D266, (2004)
[5] Chang, C. C.; Lin, C. J., Libsvma library for support vector machines, ACM Trans. Intell. Syst. Technol., 2, 27:1-27:27, (2011)
[6] Chi, S. M.; Nam, D. G., Wegolocaccurate prediction of protein subcellular localization using weighted gene ontology terms, Bioinformatics, 28, 7, 1028-1030, (2012)
[7] Chou, K. C., Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins, 43, 3, 246-255, (2001)
[8] Chou, K. C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 1, 21, 10-19, (2005)
[9] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review), J. Theor. Biol., 273, 1, 236-247, (2011) · Zbl 1405.92212
[10] Chou, K. C.; Cai, Y. D., Prediction of protein subcellular locations by GO-fund-pseaa predictor, Biochem. Biophys. Res. Co., 320, 4, 1236-1239, (2004)
[11] Chou, K. C.; Shen, H. B., Recent progresses in protein subcellular location prediction, Anal. Bioehem., 370, 1-16, (2007)
[12] Chou, K. C.; Shen, H. B., A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple siteseuk-mploc 2.0, PLoS ONE, 5, 4, e9931, (2010)
[13] Chou, K. C.; Shen, H. B., Cell-ploc 2.0an improved package of web-servers for predicting subcellular localization of proteins in various organisms, Nat. Sci., 2, 1090-1103, (2010)
[14] Chou, K. C.; Wu, Z. C.; Xiao, X., Iloc-euka multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS ONE, 6, 3, e18258, (2011)
[15] Chou, K. C., Some remarks on predicting multi-label attributes in molecular biosystems, Mol. Biosyst., 9, 6, 1092-1100, (2013)
[16] Cook, D.; Feuz, K. D.; Krishnan, N. C., Transfer learning for activity recognitiona survey, Knowl. Inf. Syst., 36, 3, 537-556, (2013)
[17] Debole, F.; Sebastiani, F., Supervised term weighting for automated text categorization, SAC, (2003)
[18] Du, P. F.; Li, Y. D., Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence, BMC Bioinformat., 7, 518, (2006)
[19] Du, P. F.; Cao, S. J.; Li, Y. D., Subchlopredicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm, J. Theor. Biol., 261, 2, 330-335, (2009) · Zbl 1403.92063
[20] Du, P. F.; Li, T. T.; Wang, X., Recent progress in predicting protein sub-subcellular locations, Exp. Rev. Protocic., 3, 8, 391-404, (2011)
[21] Du, P. F.; Tian, Y.; Yan, Y., Subcellular localization prediction for human internal and organelle membrane proteins with projected gene ontology scores, J. Theor. Biol., 313, 61-67, (2012)
[22] Du, P. F.; Li, T. T.; Wang, X.; Xu, C., Subchlo-gopredicting protein subchloroplast locations with weighted gene ontology scores, Curr. Bioinformat., 8, 193-199, (2013)
[23] Du, P. F.; Xu, C., Predicting multisite protein subcellular locationsprogress and challenges, Exp. Rev. Protocic., 10, 3, 227-237, (2013)
[24] Du, P. F.; Yu, Y., Submito-pspcppredicting protein submitochondrial locations by hybridizing positional specific physicochemical properties with pseudoamino acid compositions, BioMed Res. Int., (2013), http://dx.doi.org/10.1155/2013/263829
[25] Emanuelsson, O.; Nielsen, H.; Brunak, S.; von Heijne, G., Predicting subcellular localization of proteins based on their N-terminal amino acid sequence, J. Mol. Biol., 300, 4, 1005-1016, (2000)
[26] Fan, G. L.; Li, Q. Z., Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition, Amino Acids, 43, 2, 545-555, (2012)
[27] Farahat, A. K.; Ghodsi, A.; Kamel, M. S., Efficient greedy feature selection for unsupervised learning, Knowl. Inf. Syst., 35, 2, 285-310, (2013)
[28] Ferro, M.; Salvi, D.; Brugière, S.; Miras, S.; Kowalski, S.; Louwagie, M.; Garin, J.; Joyard, J.; Rolland, N., Proteomics of the chloroplast envelope membranes from arabidopsis thaliana, Mol. Cell. Proteom., 2, 325-345, (2003)
[29] Galavotti, L., Sebastiani, F., Simi, M., 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In: Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries (Lisbon, PT, 2000). pp. 59-68.
[30] Garg, A.; Gupta, D., Virulentpreda SVM based prediction method for virulent proteins in bacterial pathogens, BMC Bioinformat., 9, 62, (2008)
[31] Han, G. S.; Yu, Z. G.; Anh, V.; Krishnajith, A. P.D.; Tian, Y. C., An ensemble method for predicting subnuclear localizations from primary protein structures, PLoS ONE, 8, 2, e57225, (2013)
[32] Höglund, A.; Dönnes, P.; Bluml, T.; Adolph, H. W.; Kohlbaeher, O., Multilocprediction of protein subeellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition, Bioinformatics, 22, 10, 1158-1165, (2006)
[33] Hu, J.; Yan, X. H., BS-knnan effective algorithm for predicting protein subchloroplast localization, Evol. Bioinformat., 8, 79-87, (2012)
[34] Huang, W. L.; Tung, C. W.; Huang, H. L.; Ho, S. J., Predicting protein subnuclear localization using GO-amino-acid composition features, Biosystems, 98, 2, 73-79, (2009)
[35] Huang, Y.; Niu, B. F.; Gao, Y.; Fu, L. M.; Li, W. Z., CD-HIT suitea web server for clustering and comparing biological sequences, Bioinformaties, 26, 5, 680-682, (2010)
[36] Huang, H.; He, Q. M.; Chiew, K.; Qian, F.; Ma, L. H., Clovera faster prior-free approach to rare-category detection, Knowl. Inf. Syst., 35, 3, 713-736, (2013)
[37] Hunter, S.; Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Binns, D.; Bork, P.; Das, U.; Daugherty, L.; Duquenne, L.; Finn, R. D.; Gough, J.; Haft, D.; Hulo, N.; Kahn, D.; Kelly, E.; Laugraud, A.; Letunic, I.; Lonsdale, D.; Lopez, R.; Madera, M.; Maslen, J.; McAnulla, C.; McDowall, J.; Mistry, J.; Mitchell, A.; Mulder, N.; Natale, D.; Orengo, C.; Quinn, A. F.; Selengut, J. D.; Sigrist, C. J.; Thimma, M.; Thomas, P. D.; Valentin, F.; Wilson, D.; Wu, C. H.; Yeats, C., Interprothe integrative protein signature database, Nucl. Acids Res., 37, Database issue, D211-D215, (2009)
[38] Jeong, J. C.; Lin, X.; Chen, X. W., On position-specific scoring matrix for protein function prediction, IEEE/ACM Trans. Comput. Biol. Bioinformat., 8, 2, (2011)
[39] Kleffmann, T.; Russenberger, D.; von Zychlinski, A.; Christopher, W.; Sjölander, K.; Gruissem, W.; Baginsky, S., The arabidopsis thaliana chloroplast proteome reveals pathway abundance and novel protein functions, Curr. Biol., 14, 5, 354-362, (2004)
[40] Lam, W., Ho, C.Y., 1998. Using a generalized instance set for automatic text categorization. In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, AU, 1998), pp. 81-89.
[41] Lee, Y. H.; Tan, H. T.; Chung, M. C.M., Subcellular fractionation methods and strategies for proteomics, Proteomics, 10, 22, 3935-3956, (2010)
[42] Lei, J. B.; Yin, J. B.; Shen, H. B., Gfoa data driven approach for optimizing the Gaussian function based similarity metric in computational biology, Neurocomputing, 99, 307-315, (2013)
[43] Letunic, I.; Copley, R. R.; Pils, B.; Pinkert, S.; Schultz, J.; Bork, P., SMART5domains in the context of genomes and networks, Nucl. Acids Res., 34, D257-D260, (2006)
[44] Li, X.; Liao, B.; Shu, Y.; Zeng, Q. G.; Luo, J. W., Protein functional class prediction using global encoding of amino acid sequence, J. Theor. Biol., 261, 290-293, (2009) · Zbl 1403.92212
[45] Li, G. Z.; Wang, X.; Hu, X. H.; Liu, J. M.; Zhao, R. W., Multilabel learning for protein subcellular location prediction, IEEE Trans. Nanobiosci., 11, 3, 237-243, (2012)
[46] Lin, H.; Chen, W.; Yuan, L. F.; Li, Z. Q.; Ding, H., Using over-represented tetrapeptides to predict protein submitochondria locations, Acta Biotheor., 61, 259-268, (2013)
[47] Lin, T. H.; Murphy, R. F.; Joseph, Z. B., Discriminative motif finding for predicting protein subcellular localization, IEEE/ACM Trans. Comput. Biol. Bioinformat., 8, 2, 441-451, (2011)
[48] Lin, T. H.; Joseph, Z. B.; Murphy, R. F., Learning cellular sorting pathways using protein interactions and sequence motifs, J. Comput. Biol., 18, 11, 1709-1722, (2011)
[49] Mak, M.; Guo, J.; Kung, S., Pairprosvmprotein subcellular localization based on local pairwise profile alignment and SVM, IEEE/ACM Trans. Comput. Biol. Bioinformat., 5, 3, 416-422, (2008)
[50] Marchler-Bauer, A.; Anderson, J. B.; Derbyshire, M. K.; DeWeese-Scott, C.; Gonzales, N. R.; Gwadz, M.; Hao, L.; He, S.; Hurwitz, D. I.; Jackson, J. D.; Ke, Z.; Krylov, D.; Lanczycki, C. J.; Liebert, C. A.; Liu, C.; Lu, F.; Lu, S.; Marchler, G. H.; Mullokandov, M.; Song, J. S.; Thanki, N.; Yamashita, R. A.; Yin, J. J.; Zhang, D.; Bryant, S. H., Cdda conserved domain database for interactive domain family analysis, Nucl. Acids Res., 35, D237-D240, (2007)
[51] Mei, S. Y.; Wang, F.; Zhou, S. G., Gene ontology based transfer learning for protein subcellular localization, BMC Bioinformat., 12, 44, (2011)
[52] Mei, S. Y., Multi-kernel transfer learning based on Chou’s pseaac formulation for protein submitochondria location, J. Theor. Biol., 293, 121-130, (2012) · Zbl 1307.92085
[53] Mei, S. Y., Predicting plant protein subcellular multi-localization by Chou’s pseaac formulation based multi-label homolog knowledge transfer learning, J. Theor. Biol., 310, 80-87, (2012) · Zbl 1337.92065
[54] Mintz-Oron, S.; Aharoni, A.; Ruppin, E.; Shlomi, T., Network-based prediction of metabolic enzymes subcellular localization, Bioinformatics, 25, ISMB, i247-i252, (2009)
[55] Mott, R.; Sehultz, J.; Bork, P.; Ponting, C. P., Predicting protein cellular localization using a domain projection method, Genome Res., 12, 1168-1174, (2002)
[56] Mount, D. W., Bioinformatics sequence and genome analysis, (2001), Cold Spring Harbor Laboratory Press New York
[57] Murphy, R. F.; Boland, M. V.; Velliste, M., Towards a systematics for protein subcellular locationquantitative description of protein localization patterns and automated analysis of fluorescence microscope images, Proc. Int. Conf. Intell. Syst. Mol. Biol., 8, 251-259, (2000)
[58] Nakashima, H.; Nishikawa, k., Discrimination of intracellular and extracellular proteins using amino acid composition and residues-pair frequencies, J. Mol. Biol., 238, 1, 54-61, (1994)
[59] Nanni, L.; Lumini, A.; Brahnam, S., An empirical study on the matrix-based protein representations and their combination with sequence-based approaches, Amino Acids, 34, 34, (2012)
[60] Nguyen, T. T.; Chang, K. Y.; Hui, S. C., Supervised term weighting centroid-based classifiers for text categorization, Knowl. Inf. Syst., 35, 1, 61-85, (2013)
[61] Pham, D. S.; Saha, B.; Phung, D. Q.; Venkatesh, S., Detection of cross-channel anomalies, Knowl. Inf. Syst., 35, 1, 33-59, (2013)
[62] Pierleoni, A.; Martelli, P. L.; Casadio, R., Memlocipredicting subcellular localization of membrane proteins in eukaryotes, Bioinformatics, 27, 9, 1224-1230, (2011)
[63] Qiu, J. D.; Huang, J. H.; Liang, R. P.; Lu, X. Q., Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid compositionan approach from discrete wavelet transform, Anal. Biochem., 390, 1, 68-73, (2009)
[64] Quinlan, R., C4.5programs for machine learning, (1993), Morgan Kaufmann Publishers San Mateo, CA
[65] Reeck, D. R.; de Haën, C.; Teller, D. C.; Doolittle, R. F.; Fitch, W. M.; Dickerson, R. E.; Chambon, P.; McLachlan, A. D.; Margoliash, E.; Jukes, T. H., “homology” in proteins and nucleic acidsa terminology muddle and a way out of it, Cell, 50, 5, 667, (1987)
[66] Reinhart, A.; Hubbard, T., Using neural networks for prediction of the subcellular location of protein, Nucl. Acids Res., 26, 9, 2230-2236, (1998)
[67] Ruiz, M.E., Srinivasan, P., 1999. Hierarchical neural networks for text categorization. In: Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, 1999, pp. 281-282.
[68] Sahu, S. S.; Panda, G., A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction, Comput. Biol. Chem., 34, 5-6, 320-327, (2010) · Zbl 1403.92221
[69] Shi, S. P.; Qiu, J. D.; Sun, X. Y.; Huang, J. H.; Huang, S. Y.; Suo, S. B.; Liang, R. P.; Zhang, L., Identify submitochondria and subchloroplast locations with pseudo amino acid compositionapproach from the strategy of discrete wavelet transform feature extraction, BBA-Mol. Cell Res., 1813, 424-430, (2011)
[70] Shin, C. J.; Wong, S.; Davis, M. J.; Ragan, M. A., Protein-protein interaction as a predictor of subcellular location, BMC Syst. Biol., 3, 28, (2009)
[71] Soua, B.; Borgi, A.; Tagina, M., An ensemble method for fuzzy rule-based classification systems, Knowl. Inf. Syst., 36, 2, 385-410, (2013)
[72] Sun, C. L.; Zhao, X. M.; Tang, W. H.; Chen, L. N., Fgsubfusarium graminearum protein subcellular localizations predicted from primary structures, BMC Syst. Biol., 4, Suppl 2, S12, (2010)
[73] Tantoso, E.; Li, K. B., Aaindexlocpredicting subcellular localization of proteins based on a new representation of sequences using amino acid indices, Amino Acids, 35, 2, 346-353, (2008)
[74] Vapnik, V. N., Statistical learning theory, (1998), Wiley-Interscience New York · Zbl 0935.62007
[75] van Dijk, A. D.J.; Bosch, D.; ter Braak, C. J.F.; van der Krol, A. R.; van Ham, R. C.H. J., Predicting sub-golgi localization of type II membrane proteins, Bioinformatics, 24, 16, 1779-1786, (2008)
[76] Walter, F. M., Homologya personal view on some of the problems, Trends Gennt., 16, 5, 227-231, (2000)
[77] Wang, Y. C.; Wang, X. B.; Yang, Z. X.; Deng, N. Y., Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature, Protein Pept. Lett., 17, 11, 1441-1449, (2010)
[78] Witten, I. H.; Frank, E., Data miningpractical machine learning tools and techniques with JAVA implementations, (2005), Morgan Kaufmann San Francisco
[79] Wu, Z. C.; Xiao, X.; Chou, K. C., Iloc-plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. BioSyst., 7, 12, 3287-3297, (2011)
[80] Xiao, X.; Wu, Z. C.; Chou, K. C., Iloc-virusa multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. Theor. Biol., 284, 1, 42-51, (2011) · Zbl 1397.92238
[81] Xiong, H.; Capurso, D.; Sen, M. R., Sequence-based classification using discriminatory motif feature selection, PLoS ONE, 6, 11, e27382, (2011)
[82] Yang, L.; Li, Y. Z.; Xiao, R. Q.; Zeng, Y. H.; Xiao, J. M.; Tan, F. Y.; Li, M. L., Using auto covariance method for functional discrimination of membrane proteins based on evolution information, Amino Acids, 38, 1497-1503, (2010)
[83] Yin, J. B.; Li, T.; Shen, H. B., Gaussian kernel optimizationcomplex problem and a simple solution, Neurocomputing, 74, 3816-3822, (2011)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.