Prediction of Golgi-resident protein types using general form of Chou’s pseudo-amino acid compositions: approaches with minimal redundancy maximal relevance feature selection. (English) Zbl 1343.92378

Summary: Recently, several efforts have been made in predicting Golgi-resident proteins. However, it is still a challenging task to identify the type of a Golgi-resident protein. Precise prediction of the type of a Golgi-resident protein plays a key role in understanding its molecular functions in various biological processes. In this paper, we proposed to use a mutual information based feature selection scheme with the general form Chou’s pseudo-amino acid compositions to predict the Golgi-resident protein types. The positional specific physicochemical properties were applied in the Chou’s pseudo-amino acid compositions. We achieved 91.24% prediction accuracy in a jackknife test with 49 selected features. It has the best performance among all the present predictors. This result indicates that our computational model can be useful in identifying Golgi-resident protein types.


92D20 Protein sequences, DNA sequences
92C40 Biochemistry, molecular biology
92-04 Software, source code, etc. for problems pertaining to biology
Full Text: DOI


[1] Ali, F.; Hayat, M., Classification of membrane protein types using voting feature interval in combination with Chou’s pseudo amino acid composition, J. Theor. Biol., 384, 78-83, (2015) · Zbl 1343.92006
[2] Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25, 3389-3402, (1997)
[3] Cao, D.-S.; Xu, Q.-S.; Liang, Y.-Z., Propy: a tool to generate various modes of Chou’s pseaac, Bioinformatics, 29, 960-962, (2013)
[4] Chang, C.-C.; Lin, C.-J., LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol., 2, 27, 1-27, (2011)
[5] Chen, L.; Zeng, W.-M.; Cai, Y.-D.; Feng, K.-Y.; Chou, K.-C., Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities, PLoS One, 7, e35254, (2012)
[6] Chen, W.; Lin, H.; Chou, K.-C., Pseudo nucleotide composition or pseknc: an effective formulation for analyzing genomic sequences, Mol Biosyst., 11, 2620-2634, (2015)
[7] Chen, W.; Feng, P.-M.; Lin, H.; Chou, K.-C., Irspot-psednc: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res., 41, e68, (2013)
[8] Chen, W.; Lei, T.-Y.; Jin, D.-C.; Lin, H.; Chou, K.-C., Pseknc: a flexible web server for generating pseudo K-tuple nucleotide composition, Anal. Biochem., 456, 53-60, (2014)
[9] Chen, W.; Ding, H.; Feng, P.; Lin, H.; Chou, K.-C., Iacp: a sequence-based tool for identifying anticancer peptides, Oncotarget, (2016)
[10] Chen, W.; Feng, P.; Ding, H.; Lin, H.; Chou, K.-C., Using deformation energy to analyze nucleosome positioning in genomes, Genomics, 107, 69-75, (2016)
[11] Chen, W.; Zhang, X.; Brooker, J.; Lin, H.; Zhang, L.; Chou, K.-C., Pseknc-general: a cross-platform package for generating various modes of pseudo nucleotide compositions, Bioinformatics, 31, 119-120, (2015)
[12] Chou, K. C., Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins, 43, 246-255, (2001)
[13] Chou, K. C.; Zhang, C. T., Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol., 30, 275-349, (1995)
[14] Chou, K.-C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21, 10-19, (2005)
[15] Chou, K.-C., Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Curr. Proteom., 6, 262-274, (2009)
[16] Chou, K.-C., Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol., 273, 236-247, (2011) · Zbl 1405.92212
[17] Chou, K.-C., Some remarks on predicting multi-label attributes in molecular biosystems, Mol. Biosyst., 9, 1092-1100, (2013)
[18] Chou, K.-C., Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11, 218-234, (2015)
[19] Chou, K.-C.; Cai, Y.-D., Predicting protein quaternary structure by pseudo amino acid composition, Proteins, 53, 282-289, (2003)
[20] Chou, K.-C.; Shen, H.-B., Cell-ploc: a package of web servers for predicting subcellular localization of proteins in various organisms, Nat. Protoc., 3, 153-162, (2008)
[21] Chou, K.-C.; Shen, H.-B., Plant-mploc: a top-down strategy to augment the power for predicting plant protein subcellular localization, PLoS One, 5, e11335, (2010)
[22] Chou, K.-C.; Wu, Z.-C.; Xiao, X., Iloc-euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS One, 6, (2011)
[23] Chou, K.-C.; Wu, Z.-C.; Xiao, X., Iloc-hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. BioSyst., 8, 629-641, (2012)
[24] Dehzangi, A.; Heffernan, R.; Sharma, A.; Lyons, J.; Paliwal, K.; Sattar, A., Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into chou׳s general pseaac, J. Theor. Biol., 364, 284-294, (2015) · Zbl 1405.92092
[25] Ding, C.; Peng, H., Minimum redundancy feature selection from microarray gene expression data, J Bioinform. Comput. Biol., 3, 185-205, (2005)
[26] Ding, H.; Liu, L.; Guo, F.-B.; Huang, J.; Lin, H., Identify golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition, Protein Pept. Lett., 18, 58-63, (2011)
[27] Ding, H.; Guo, S.-H.; Deng, E.-Z.; Yuan, L.-F.; Guo, F.-B.; Huang, J., Prediction of golgi-resident protein types by using feature selection technique, Chemom. Intell. Lab. Syst., 124, 9-13, (2013)
[28] Du, P.; Li, Y., Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence, BMC Bioinformatics, 7, 518, (2006)
[29] Du, P.; Yu, Y., Submito-PSPCP: predicting protein submitochondrial locations by hybridizing positional specific physicochemical properties with pseudoamino acid compositions, BioMed. Res. Int., 2013, 1-7, (2013)
[30] Du, P.; Cao, S.; Li, Y., Subchlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm, J. Theor. Biol., 261, 330-335, (2009) · Zbl 1403.92063
[31] Du, P.; Li, T.; Wang, X., Recent progress in predicting protein sub-subcellular locations, Expert Rev. Proteom., 8, 391-404, (2011)
[32] Du, P.; Gu, S.; Jiao, Y., Pseaac-general: fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets, Int. J. Mol. Sci., 15, 3495-3506, (2014)
[33] Du, P.; Wang, X.; Xu, C.; Gao, Y., Pseaac-builder: a cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions, Anal. Biochem., 425, 117-119, (2012)
[34] Fujita, Y.; Ohama, E.; Takatama, M.; Al-Sarraj, S.; Okamoto, K., Fragmentation of golgi apparatus of nigral neurons with alpha-synuclein-positive inclusions in patients with Parkinson’s disease, Acta Neuropathol., 112, 261-265, (2006)
[35] Gonatas, N. K.; Gonatas, J. O.; Stieber, A., The involvement of the golgi apparatus in the pathogenesis of amyotrophic lateral sclerosis, Alzheimer’s disease, and ricin intoxication, Histochem. Cell Biol., 109, 591-600, (1998)
[36] Guo, S.-H.; Deng, E.-Z.; Xu, L.-Q.; Ding, H.; Lin, H.; Chen, W., Inuc-pseknc: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition, Bioinformatics, 30, 1522-1529, (2014)
[37] Hu, L.; Huang, T.; Shi, X.; Lu, W.-C.; Cai, Y.-D.; Chou, K.-C., Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties, PLoS One, 6, e14556, (2011)
[38] Huang, T.; Chen, L.; Cai, Y.-D.; Chou, K.-C., Classification and analysis of regulatory pathways using graph property, biochemical and physicochemical property, and functional property, PLoS One, 6, e25297, (2011)
[39] Huang, T.; Wang, J.; Cai, Y.-D.; Yu, H.; Chou, K.-C., Hepatitis C virus network based classification of hepatocellular cirrhosis and carcinoma, PLoS One, 7, e34460, (2012)
[40] Huang, T.; Niu, S.; Xu, Z.; Huang, Y.; Kong, X.; Cai, Y.-D., Predicting transcriptional activity of multiple site p53 mutants based on hybrid properties, PLoS One, 6, e22940, (2011)
[41] Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C., Ippi-esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac, J. Theor. Biol., 377, 47-56, (2015)
[42] Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C., Isuc-pseopt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset, Anal. Biochem., 497, 48-56, (2016)
[43] Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C., Psuc-lys: predict lysine succinylation sites in proteins with pseaac and ensemble random forest approach, J. Theor. Biol., 394, 223-230, (2016) · Zbl 1343.92153
[44] Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C., Ippbs-opt: a sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets, Molecules, 21, (2016)
[45] Jiao, Y.-S.; Du, P.-F., Predicting golgi-resident protein types using pseudo amino acid compositions: approaches with positional specific physicochemical properties, J. Theor. Biol., 391, 35-42, (2016) · Zbl 1343.92154
[46] Jiao, Y., Du, P. , Su, X., Predicting Golgi-resident proteins in plants by incorporating N-terminal transmembrane domain information in the general form of Chou’s pseudoamino acid compositions. In: 8th International Conference on Systems Biology (ISB), 2014: pp. 226-229. 〈doi:10.1109/ISB.2014.6990759〉.
[47] Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M., Aaindex: amino acid index database, progress report 2008, Nucleic Acids Res., 36, (2008)
[48] Kumar, R.; Srivastava, A.; Kumari, B.; Kumar, M., Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine, J. Theor. Biol., 365, 96-103, (2015) · Zbl 1314.92055
[49] Li, B.-Q.; Hu, L.-L.; Chen, L.; Feng, K.-Y.; Cai, Y.-D.; Chou, K.-C., Prediction of protein domain with mrmr feature selection and analysis, PLoS One, 7, e39308, (2012)
[50] Li, B.-Q.; Huang, T.; Liu, L.; Cai, Y.-D.; Chou, K.-C., Identification of colorectal cancer related genes with mrmr and shortest path in protein-protein interaction network, PLoS One, 7, e33393, (2012)
[51] Lin, W.-Z.; Fang, J.-A.; Xiao, X.; Chou, K.-C., Iloc-animal: a multi-label learning classifier for predicting subcellular localization of animal proteins, Mol. Biosyst., 9, 634, (2013)
[52] Liu, B.; Long, R.; Chou, K.-C., Idhs-EL: identifying dnase I hypersensitive-sites by fusing three different modes of pseu-do nucleotide composition into an ensemble learning framework, Bioinformatics, (2016)
[53] Liu, B.; Liu, F.; Fang, L.; Wang, X.; Chou, K.-C., Repdna: a python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects, Bioinformatics, 31, 1307-1309, (2015)
[54] Liu, B.; Liu, F.; Fang, L.; Wang, X.; Chou, K.-C., Reprna: a web server for generating various feature vectors of RNA sequences, Mol. Genet. Genom., (2015)
[55] Liu, B.; Fang, L.; Wang, S.; Wang, X.; Li, H.; Chou, K.-C., Identification of microrna precursor with the degenerate K-tuple or kmer strategy, J. Theor. Biol., 385, 153-159, (2015)
[56] Liu, B.; Fang, L.; Liu, F.; Wang, X.; Chou, K.-C., Imirna-psedpc: microrna precursor identification with a pseudo distance-pair composition approach, J. Biomol. Struct. Dyn., 1-13, (2015)
[57] Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K.-C., Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nucleic Acids Res., 43, W65-W71, (2015)
[58] Liu, B.; Xu, J.; Fan, S.; Xu, R.; Zhou, J.; Wang, X., Psedna-pro: DNA-binding protein identification by combining Chou’s pseaac and physicochemical distance transformation, Mol. Inf., 34, 8-17, (2015)
[59] Liu, B.; Fang, L.; Long, R.; Lan, X.; Chou, K.-C., Ienhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics, 32, 362-369, (2016)
[60] Liu, Z.; Xiao, X.; Qiu, W.-R.; Chou, K.-C., Idna-methyl: identifying DNA methylation sites via pseudo trinucleotide composition, Anal. Biochem., 474, 69-77, (2015)
[61] Liu, Z.; Xiao, X.; Yu, D.-J.; Jia, J.; Qiu, W.-R.; Chou, K.-C., Prnam-PC: predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties, Anal. Biochem., 497, 60-67, (2016)
[62] Peng, Hanchuan; Long, Fuhui; Ding, C., Feature selection based on mutual information criteria of MAX-dependency, MAX-relevance, and MIN-redundancy, IEEE Trans. Pattern Anal. Mach. Intell., 27, 1226-1238, (2005)
[63] Shen, H.-B.; Chou, K.-C., Virus-ploc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells, Biopolymers, 85, 233-240, (2007)
[64] Shen, H.-B.; Chou, K.-C., Nuc-ploc: a new web-server for predicting protein subnuclear localization by fusing pseaa composition and psepssm, Protein Eng. Des. Sel., 20, 561-567, (2007)
[65] Shen, H.-B.; Chou, K.-C., Pseaac: a flexible web server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem., 373, 386-388, (2008)
[66] Shen, H.-B.; Yang, J.; Chou, K.-C., Euk-ploc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction, Amino Acids, 33, 57-67, (2007)
[67] Shen, H.-B.; Yi, D.-L.; Yao, L.-X.; Yang, J.; Chou, K.-C., Knowledge-based computational intelligence development for predicting protein secondary structures from sequences, Expert Rev. Proteom., 5, 653-662, (2008)
[68] van Dijk, A. D.J.; Bosch, D.; ter Braak, C. J.F.; van der Krol, A. R.; van Ham, R. C.H. J., Predicting sub-golgi localization of type II membrane proteins, Bioinformatics, 24, 1779-1786, (2008)
[69] Vapnik, V. N., The nature of statistical learning theory, (2000), Springer New York, New York, NY, (accessed 15.12.15) · Zbl 0934.62009
[70] Wang, P.; Hu, L.; Liu, G.; Jiang, N.; Chen, X.; Xu, J., Prediction of antimicrobial peptides based on sequence alignment and feature selection methods, PLoS One, 6, e18476, (2011)
[71] Wu, Z.-C.; Xiao, X.; Chou, K.-C., Iloc-plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. Biosyst., 7, 3287, (2011)
[72] Wu, Z.-C.; Xiao, X.; Chou, K.-C., Iloc-gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins, Protein Peptide Lett., 19, 4-14, (2012)
[73] Xiao, X.; Wu, Z.-C.; Chou, K.-C., Iloc-virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. Theor. Biol., 284, 42-51, (2011) · Zbl 1397.92238
[74] Xiao, X.; Wang, P.; Lin, W.-Z.; Jia, J.-H.; Chou, K.-C., Iamp-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types, Anal. Biochem., 436, 168-177, (2013)
[75] Xiao, X.; Min, J.-L.; Lin, W.-Z.; Liu, Z.; Cheng, X.; Chou, K.-C., Idrug-target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach, J. Biomol. Struct. Dyn., 33, 2221-2233, (2015)
[76] Xu, Y.; Chou, K.-C., Recent progress in predicting posttranslational modification sites in proteins, Curr. Top. Med. Chem., 16, 591-603, (2016)
[77] Zheng, L.-L.; Li, Y.-X.; Ding, J.; Guo, X.-K.; Feng, K.-Y.; Wang, Y.-J., A comparison of computational methods for identifying virulence factors, PLoS One, 7, e42517, (2012)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.