Prediction of S-sulfenylation sites using mRMR feature selection and fuzzy support vector machine algorithm. (English) Zbl 1406.92190

Summary: Cysteine S-sulfenylation is an important protein post-translational modification, which plays a crucial role in transcriptional regulation, cell signaling, and protein functions. To better elucidate the molecular mechanism of S-sulfenylation, it is important to identify S-sulfenylated substrates and their corresponding S-sulfenylation sites accurately. In this study, a novel bioinformatics tool named Sulf\(_-\)FSVM is proposed to predict S-sulfenylation sites by using multiple feature extraction and fuzzy support vector machine algorithm. On the one hand, amino acid factors, binary encoding, and the composition of k-spaced amino acid pairs features are incorporated to encode S-sulfenylation sites. And the maximum relevance minimum redundancy method are adopted to remove the redundant features. On the other hand, a fuzzy support vector machine algorithm is used to handle the class imbalance and noise problem in S-sulfenylation sites training dataset. As illustrated by 10-fold cross-validation, the performance of Sulf\(_-\)FSVM achieves a satisfactory performance with a sensitivity of 73.26%, a specificity of 70.78%, an accuracy of 71.07% and a Matthew’s correlation coefficient of 0.2971. Independent tests also show that Sulf\(_-\)FSVM significantly outperforms existing S-sulfenylation sites predictors. Therefore, Sulf\(_-\)FSVM can be a useful tool for accurate prediction of protein S-sulfenylation sites.


92C40 Biochemistry, molecular biology
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI


[1] Ahmad, K.; Waris, M.; Hayat, M., Prediction of protein submitochondrial locations by incorporating dipeptide composition into Chou’s general pseudo amino acid composition, J. Membr. Biol., 249, 293-304, (2016)
[2] Al Maruf, M. A.; Shatabda, S., Irspot-SF: prediction of recombination hotspots by incorporating sequence based features into Chou’s pseudo components, Genomics., (2018), (2018)
[3] Antelmann, H.; Helmann, J. D., Thiol-based redox switches and gene regulation, Antioxid. Redox Signal., 14, 1049-1063, (2011)
[4] Atchley, W. R.; Zhao, J.; Fernandes, A. D.; D¨ruke, T., Solving the protein sequencemetric problem, Proc. Natl. Acad. Sci. U. S. A., 102, 6395-6400, (2005)
[5] Batuwita, R.; Palade, V., Class imbalance learning methods for support vector machines, Imbalanced Learn. Found. Algorithms Appl., 1, 83-99, (2013)
[6] Beltrao, P.; Albanèse, V.; Kenner, L. R.; Swaney, D. L.; Burlingame, A.; Villén, J.; Lim, W. A.; Fraser, J. S.; Frydman, J.; Krogan, N. J., Systematic functional prioritization of protein post-translational modifications, Cell, 150, 413-425, (2012)
[7] Bui, V. M.; Lu, C. T.; Ho, T. T.; Lee, T. Y., MDD-SOH: exploiting maximal dependence decomposition to identify S-sulfenylation sites with substrate motifs, Bioinformatics, 32, 165-172, (2016)
[8] Bui, V. M.; Weng, S. L.; Lu, C. T.; Chang, T. H.; Weng, J. T.; Lee, T. Y., Sohsite: incorporating evolutionary information and physicochemical properties to identify protein S-sulfenylation sites, BMC Genomics, 17, 59-70, (2016)
[9] Chen, K.; Kurgan, L. A.; Ruan, J., Prediction of flexible/rigid regions from proteinsequences using k-spaced amino acid pairs, BMC Struct. Biol., 7, 25, (2007)
[10] Chen, W.; Ding, H.; Feng, P., Iacp: a sequence-based tool for identifying anticancer peptides, Oncotarget, 7, 16895-16909, (2016)
[11] Chen, W.; Feng, P.; Ding, H.; Lin, H., Irna-methyl: identifying N6-methyladenosine sites using pseudo nucleotide composition, Anal. Biochem., 490, 26-33, (2015)
[12] Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H., Irna-3typea: identifying 3-types of modification at RNA’s adenosine sites, Mol. Ther. Nucleic Acids, 11, 468-474, (2018)
[13] Chen, W.; Lin, H.; Chou, K. C., Pseudo nucleotide composition or pseknc: an effective formulation for analyzing genomic sequences, Mol. BioSyst., 11, 2620-2634, (2015)
[14] Chen, W.; Tang, H.; Ye, J.; Lin, H.; Chou, K. C., Irna-pseu: identifying RNA pseudouridine sites, Mol. Ther.Nucleic Acids, 5, e332, (2016)
[15] Chen, Y. Z.; Tang, Y. R.; Sheng, Z. Y.; Zhang, Z., Prediction of mucin-type O-glycosylation sites in Mammalian proteins using the composition of k-spaced amino acid pairs, BMC Bioinf, 9, 101, (1999)
[16] Cheng, X.; Xiao, X.; Chou, K. C., Ploc-meuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general pseaac, Genomics, 110, 50-58, (2018)
[17] Cheng, X.; Zhao, S. G.; Lin, W. Z., Ploc-manimal: predict subcellular localization of animal proteins with both single and multiple sites, Bioinformatics, 33, 3524-3531, (2017)
[18] Cheng, X.; Zhao, S. G.; Xiao, X., Iatc-misf: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics, 33, 341-346, (2017)
[19] Chou, K. C., Prediction of protein cellular attributes using pseudo amino acid composition, Proteins Struct. Funct. Genet., 43, 246-255, (2001)
[20] Chou, K. C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21, 10-19, (2005)
[21] Chou, K. C., Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Curr. Proteomics, 6, 262-274, (2009)
[22] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review), J. Theor. Biol., 273, 236-247, (2011) · Zbl 1405.92212
[23] Chou, K. C., Some remarks on predicting multi-label attributes in molecular biosystems, Mol. BioSyst., 9, 1092-1100, (2013)
[24] Chou, K. C., Impacts of bioinformatics to medicinal chemistry. med, Chem, 11, 218-234, (2015)
[25] Chou, K. C., An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Curr. Top. Med. Chem., 17, 2337-2358, (2017)
[26] Deng, L.; Xu, X. J.; Liu, H., Predcso: an ensemble method for the prediction of S-sulfenylation sites in proteins, Mol. Omics, 14, 257-265, (2018)
[27] Ding, Y. S.; Zhang, T. L.; Chou, K. C., Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network, Protein Pept. Lett., 14, 811-815, (2007)
[28] Feng, P.; Ding, H.; Yang, H.; Chen, W., Irna-psecoll: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into pseknc, Mol. Ther. Nucleic Acids, 7, 155-163, (2017)
[29] Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chen, W., Idna6ma-pseknc: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into pseknc, Genomics., (2018), (2018)
[30] Gupta, V.; Carroll, K. S., Sulfenic acid chemistry, detection and cellular lifetime, Biochim. Biophys. Acta., 1840, 847-875, (2014)
[31] Gupta, M. K.; Niyogi, R.; Misra, M., An alignment-free method to find similarity among protein sequences via the general form of Chou’s pseudo amino acid composition, SAR QSAR Environ. Res., 24, 597-609, (2013)
[32] Hasan, M. M.; Guo, D.; Kurata, H., Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information, Mol. BioSyst., 13, 2545-2550, (2017)
[33] Hasan, M. M.; Li, J.; Ahmad, S.; Molla, M. I., Predcar-site: carbonylation sites prediction in proteins using support vector machine with resolving data imbalanced issue, Anal. Biochem., 525, 107-113, (2017)
[34] Hayat, M.; Iqbal, N., Discriminating protein structure classes by incorporating pseudo average chemical shift to Chou’s general pseaac and support vector machine, Comput. Methods Programs Biomed., 116, 184-192, (2014)
[35] Jia, C. Z.; He, W. Y.; Yao, Y. H., OH-PRED: prediction of protein hydroxylation sites by incorporating adapted normal distribution bi-profile Bayes feature extraction and physicochemical properties of amino acids, J. Biomol. Struct. Dyn., 35, 829-835, (2017)
[36] Jia, C. Z.; Zuo, Y., S-sulfpred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique, J. Theor. Biol., 422, 84-89, (2017)
[37] Jia, J.; Liu, Z.; Xiao, X.; Liu, B., Isuc-pseopt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset, Anal. Biochem., 497, 48-56, (2016)
[38] Jia, J.; Liu, Z.; Xiao, X.; Liu, B., Psuc-lys: predict lysine succinylation sites in proteins with pseaac and ensemble random forest approach, J. Theor. Biol., 394, 223-230, (2016) · Zbl 1343.92153
[39] Ju, Z.; Cao, J. Z., Prediction of protein N-formylation using the composition of k-spaced amino acid pairs, Anal. Biochem., 534, 40-45, (2017)
[40] Ju, Z.; Cao, J. Z.; Gu, H., Ilm-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou’s general pseaac, J. Theor. Biol., 385, 50-57, (2015) · Zbl 1343.92157
[41] Ju, Z.; Cao, J. Z.; Gu, H., Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou’s general pseaac, J, Theor. Biol., 397, 145-150, (2016)
[42] Ju, Z.; He, J. J., Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into Chou’s general pseaac, J. Mol. Graphics Modell., 77, 200-204, (2017)
[43] Kawashima, S.; Kanehisa, M., Aaindex: amino acid index database, Nucleic Acids Res, 1, 374, (2000)
[44] Khan, Y. D.; Rasool, N.; Hussain, W.; Khan, S. A., Iphost-pseaac: identify phosphothreonine sites by incorporating sequence statistical moments into pseaac, Anal. Biochem., 550, 109-116, (2018)
[45] Li, B. Q.; Hu, L. L.; Chen, L.; Feng, K. Y.; Cai, Y. D.; Chou, K. C., Prediction of protein domain with mrmr feature selection and analysis, PLoS One, 7, e39308, (2012)
[46] Li, B. Q.; Hu, L. L.; Niu, S.; Cai, Y. D.; Chou, K. C., Predict and analyze S-nitrosylation modification sites with the mrmr and IFS approaches, J. Proteomics, 75, 1654-1665, (2012)
[47] Li, B. Q.; Huang, T.; Liu, L.; Cai, Y. D.; Chou, K. C., Identification of colorectal cancer related genes with mrmr and shortest path in protein-protein interaction network, PLoS One, 7, e33393, (2012)
[48] Lin, C. F.; Wang, S. D., Fuzzy support vector machines., IEEE Trans. Neural Netw, 13, 464-471, (2002)
[49] Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K. C., Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nucleic Acids Res, 43, W65-W71, (2015)
[50] Liu, Z.; Xiao, X.; Yu, D. J.; Jia, J., Prnam-PC: predicting N-methyladenosine sites in RNA sequences via physical-chemical properties, Anal. Biochem., 497, 60-67, (2016)
[51] Liu, B.; Fang, L.; Long, R.; Lan, X.; Chou, K. C., Ienhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics, 32, 362-369, (2016)
[52] Liu, B.; Long, R.; Chou, K. C., Idhs-EL: identifying dnase I hypersensi-tivesites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework, Bioinformatics, 32, 2411-2418, (2016)
[53] Liu, L. M.; Xu, Y.; Chou, K. C., Ipgk-pseaac: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general pseaac, Med. Chem., 13, 552-559, (2017)
[54] Liu, B.; Wang, S.; Long, R.; Chou, K. C., Irspot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics, 33, 35-41, (2017)
[55] Liu, B.; Yang, F.; Chou, K. C., 2L-pirna: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function, Mol. Ther.-Nucleic Acids, 7, 267-277, (2017)
[56] Liu, B.; Wu, H.; Chou, K. C., Pse-in-one 2.0: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nat. Sci., 9, 67-91, (2017)
[57] Liu, B.; Li, K.; Huang, D. S.; Chou, K. C., Ienhancer-EL: identifying enhancers and their strength with ensemble learning approach, Bioinformatics., (2018), (2018)
[58] Liu, B.; Weng, F.; Huang, D. S.; Chou, K. C., Iro-3wpseknc: identify DNA replication origins by three-window-based pseknc, Bioinformatics., (2018), (2018)
[59] Liu, B.; Yang, F.; Huang, D. S.; Chou, K. C., Ipromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based pseknc, Bioinformatics, 34, 33-40, (2018)
[60] Meher, P. K.; Sahu, T. K.; Saini, V.; Rao, A. R., Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general pseaac, Sci. Rep., 7, 42362, (2017)
[61] Peng, H.; Long, F.; Ding, C., Feature selection based on mutual information criteria of MAX-dependency, MAX-relevance, and MIN-redundancy, IEEE Trans. Pattern Anal. Mach. Intell., 27, 1226-1238, (2005)
[62] Qiu, W. R.; Jiang, S. Y.; Sun, B. Q., Irna-2methyl: identify RNA 2′-O-methylation sites by incorporating sequence-coupled effects into general pseknc and ensemble classifier, Med. Chem., 13, 734-743, (2017)
[63] Qiu, W. R.; Sun, B. Q.; Xiao, X., Iphos-pseevo: identifying human phosphorylated proteins by incorporating evolutionary information into general pseaac via grey system theory, Mol. Inf., 36, (2017)
[64] Qiu, W. R.; Xiao, X.; Xu, Z. C., Iphos-pseen: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier, Oncotarget, 7, 51270-51283, (2016)
[65] Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, Z. C., Iptm-mlys: identifying multiple lysine PTM sites and their different types, Bioinformatics, 32, 3116-3123, (2016), (2016)
[66] Qiu, W. R.; Xiao, X.; Lin, W. Z., Imethyl-pseaac: identification of protein methylation sites via a pseudo amino acid composition approach, Biomed. Res. Int., (2014)
[67] Sakka, M.; Tzortzis, G.; Mantzaris, M. D.; Bekas, N.; Kellici, T. F.; Likas, A.; Galaris, D.; Gerothanassis, I. P.; Tzakos, A. G., PRESS: protein S-sulfenylation server, Bioinformatics, 32, 2710-2712, (2016)
[68] Sharma, R.; Dehzangi, A.; Lyons, J.; Paliwal, K.; Tsunoda, T.; Sharma, A., Predict Gram-positive and Gram-negative subcellular localization via incorporating evolutionary information and physicochemical features into Chou’s general pseaac, IEEE Trans. Nanobiosci., 14, 915-926, (2015)
[69] Shen, H. B.; Yang, J.; Chou, K. C., Fuzzy KNN for predicting membrane protein types from pseudo amino acid composition, J. Theor. Biol., 240, 9-13, (2006)
[70] Su, Z. D.; Huang, Y.; Zhang, Z. Y.; Zhao, Y. W.; Wang, D.; Chen, W.; Chou, K. C.; Lin, H., Iloc-lncrna: predict the subcellular location of lncrnas by incorporating octamer composition into general pseknc, Bioinformatics., (2018), (2018)
[71] Vacic, V.; Iakoucheva, L. M.; Radivojac, P., Two sample logo: a graphical representation of the differences between two sets of sequence alignments, Bioinformatics, 22, 1536-1537, (2006)
[72] Veropoulos, K.; Campbell, C.; Cristianini, N., Controlling the sensitivity of support vector machines, (Proceedings of the International Joint Conference on Artificial Intelligence, (1999)), 55-60
[73] Wang, X.; Yan, R.; Li, J.; Song, J., SOHPRED: a new bioinformatics tool for the characterization and prediction of human S-sulfenylation sites, Mol. BioSyst., 12, 2849-2858, (2016)
[74] Xu, Y.; Ding, J.; Wu, L. Y., Isulf-cys: prediction of S-sulfenylation sites in proteins with physicochemical properties of amino acids, Plos One, 11, (2016)
[75] Xu, Y.; Shao, X. J.; Wu, L. Y.; Deng, N. Y., Isno-aapair: incorporating amino acid pairwise coupling into pseaac for predicting cysteine S-nitrosylation sites in proteins, PeerJ, 1, e171, (2013)
[76] Xu, Y.; Wen, X.; Wen, L. S.; Wu, L. Y., Initro-tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition, PLoS ONE, 9, (2014)
[77] Yang, H.; Qiu, W. R.; Liu, G.; Guo, F. B.; Chen, W.; Chou, K. C.; Lin, H., Irspot-pse6NC: identifying recombination spots in saccharomyces cerevisiae by incorporating hexamer composition into general pseknc, Int. J. Biol. Sci., 14, 883-891, (2018)
[78] Yang, J.; Gupta, V.; Tallman, K. A.; Porter, N. A.; Carroll, K. S.; Liebler, D. C., Global, in situ, site-specific analysis of protein S-sulfenylation, Nat. Protoc., 10, 1022-1037, (2015)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.