Predicting deleterious non-synonymous single nucleotide polymorphisms in signal peptides based on hybrid sequence attributes. (English) Zbl 1244.92019

Summary: Signal peptides play a crucial role in various biological processes, such as localization of cell surface receptors, translocation of secreted proteins and cell-cell communication. However, the amino acid mutation in signal peptides, also called non-synonymous single nucleotide polymorphisms (nsSNPs or SAPs), may lead to the loss of their functions. In the present study, a computational method was proposed for predicting deleterious nsSNPs in signal peptides based on random forests (RFs) by incorporating position specific scoring matrix (PSSM) profiles, SignalP score and physicochemical properties. These features were optimized by the maximum relevance minimum redundancy (mRMR) method. Then, a cost matrix was used to minimize the effect of the imbalanced data classification problem that usually occurred in nsSNPs prediction. The method achieved an overall accuracy of 84.5% and the area under the ROC curve (AUC) of 0.822 by a jackknife test, when the optimal subset included 10 features. Furthermore, on the same data set, we compared our predictor with other existing methods, including the R-score-based method and D-score-based methods, and the result of our method was superior to those of the two methods. The satisfactory performance suggests that our method is effective in predicting the deleterious nsSNPs in signal peptides.


92C40 Biochemistry, molecular biology
92C37 Cell biology
62P10 Applications of statistics to biology and medical sciences; meta analysis
92-08 Computational methods for problems pertaining to biology


jackknife test
Full Text: DOI


[1] Arnold, A.; Horst, S.A.; Gardella, T.J.; Baba, H.; Levine, M.A.; Kronenberg, H.M., Mutation of the signal peptide-encoding region of the preproparathyroid hormone gene in familial isolated hypoparathyroidism, J. clin. invest., 86, 7-1084, (1990)
[2] Adzhubei, I.A.; Schmidt, S.; Peshkin, L.; Ramensky, V.E.; Gerasimova, A.; Bork, P.; Kondrashov, A.S.; Sunyaev, S.R., A method and server for predicting damaging missense mutations, Nat. methods, 7, 248-249, (2010)
[3] Bao, L.; Zhou, M.; Cui, Y., Nssnpanalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms, Nucleic acids res., 33, W480-W482, (2005)
[4] Bendtsen, J.D.; Nielsen, H.; von Heijne, G.; Brunak, S., Improved prediction of signal peptides: signalp 3.0, J. mol. biol., 340, 95-783, (2004)
[5] Breiman, L., Random forests, Mach. learn., 45, 5-32, (2001) · Zbl 1007.68152
[6] Capriotti, E.; Calabrese, R.; Casadio, R., Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information, Bioinformatics, 22, 34-2729, (2006)
[7] Care, M.A.; Needham, C.J.; Bulpitt, A.J.; Westhead, D.R., Deleterious SNP prediction: be mindful of your training data!, Bioinformatics, 23, 72-664, (2007)
[8] Choo, K.H.; Tan, T.W.; Ranganathan, S., Spdb – a signal peptide database, BMC bioinformatics, 6, (2005)
[9] Chou, K.C.; Shen, H.B., Recent progress in protein subcellular location prediction, Anal. biochem., 370, 1-16, (2007)
[10] Chou, K.C.; Shen, H.B., Recent advances in developing web-servers for predicting protein attributes, Nat. sci., 2, 63-92, (2009)
[11] Ferrer Costa, C.; Gelpi, J.L.; Zamakola, L.; Parraga, I.; de la Cruz, X.; Orozco, M., PMUT: a web-based tool for the annotation of pathological mutations on proteins, Bioinformatics, 21, 8-3176, (2005)
[12] Fingerhut, A.; Reutrakul, S.; Knuedeler, S.D.; Moeller, L.C.; Greenlee, C.; Refetoff, S.; Janssen, O.E., Partial deficiency of thyroxine-binding globulin-allentown is due to a mutation in the signal peptide, J. clin. endocrinol. metab., 89, 83-2477, (2004)
[13] Frank, E.; Hall, M.; Trigg, L.; Holmes, G.; Witten, I.H., Data mining in bioinformatics using weka, Bioinformatics, 20, 81-2479, (2004)
[14] Grantham, R., Amino acid difference formula to help explain protein evolution, Science, 185, 4-862, (1974)
[15] Guo, Y.Z.; Li, M.L.; Lu, M.; Wen, Z.; Wang, K.; Li, G.; Wu, J., Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform, Amino acids, 30, 397-402, (2006)
[16] Hon, L.S.; Zhang, Y.; Kaminker, J.S.; Zhang, Z., Computational prediction of the functional effects of amino acid substitutions in signal peptides using a model-based approach, Hum. mutat., 30, 99-106, (2009)
[17] Huang, T.; Wang, P.; Ye, Z.Q.; Xu, H.; He, Z.; Feng, K.Y.; Hu, L.L.; Cui, W.R.; Wang, K.; Dong, X.; Xie, L.; Kong, X.Y.; Cai, Y.D.; Li, Y.X., Prediction of deleterious non-synonymous SNPs based on protein interaction network and hybrid properties, Plos one, 5, (2010)
[18] Ito, M.; Oiso, Y.; Murase, T.; Kondo, K.; Saito, H.; Chinzei, T.; Racchi, M.; Lively, M.O., Possible involvement of inefficient cleavage of preprovasopressin by signal peptidase as a cause for familial central diabetes insipidus, J. clin. invest., 91, 71-2565, (1993)
[19] Jarjanazi, H.; Savas, S.; Pabalan, N.; Dennis, J.W.; Ozcelik, H., Biological implications of SNPs in signal peptide domains of human proteins, Proteins: struct. funct. bioinformatics, 70, 394-403, (2008)
[20] Karaplis, A.C.; Lim, S.K.; Baba, H.; Arnold, A.; Kronenberg, H.M., Inefficient membrane targeting, translocation, and proteolytic processing by signal peptidase of a mutant preproparathyroid hormone protein, J. biol. chem., 270, 35-1629, (1995)
[21] Kiraly, O.; Boulling, A.; Witt, H.; Le Marechal, C.; Chen, J.M.; Rosendahl, J.; Battaggia, C.; Wartmann, T.; Sahin Toth, M.; Ferec, C., Signal peptide variants that impair secretion of pancreatic secretory trypsin inhibitor (SPINK1) cause autosomal dominant hereditary pancreatitis, Hum. mutat., 28, 76-469, (2007)
[22] Krigbaum, W.R.; Komoriya, A., Local interactions as a structure determinant for protein molecules: II, Biochim. biophys. acta, 576, 48-204, (1979)
[23] Li, S.; Li, H.; Li, M.; Shyr, Y.; Xie, L.; Li, Y., Improved prediction of lysine acetylation by support vector machines, Protein pept. lett., 16, 83-977, (2009)
[24] Margineantu, D.D., When does imbalanced data require more than cost-sensitive learning?, (), 47-50, July 31
[25] Ng, P.C.; Henikoff, S., SIFT: predicting amino acid changes that affect protein function, Nucleic acids res., 31, 4-3812, (2003)
[26] Nguyen, T.N.; Gantner, Z.; Schmidt Thieme, L., Cost-sensitive learning methods for imbalanced data, (), 1-8, July 18-23
[27] Nielsen, H.; Engelbrecht, J.; Brunak, S.; vonHeijne, G., Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites, Protein eng., 10, 1-6, (1997)
[28] Niu, S.; Huang, T.; Feng, K.; Cai, Y.; Li, Y., Prediction of tyrosine sulfation with mrmr feature selection and analysis, J. proteome res., 9, 7-6490, (2010)
[29] Peng, H.C.; Long, F.H.; Ding, C., Feature selection based on mutual information: criteria of MAX-dependency, MAX-relevance, and MIN-redundancy, IEEE trans. pattern. anal. Mach. intell., 27, 38-1226, (2005)
[30] Petrova, N.V.; Wu, C.H., Prediction of catalytic residues using support vector machine with selected protein sequence and structural properties, BMC bioinformatics, 7, (2006)
[31] Pidasheva, S.; Canaff, L.; Simonds, W.F.; Marx, S.J.; Hendy, G.N., Impaired cotranslational processing of the calcium-sensing receptor due to signal peptide missense mutations in familial hypocalciuric hypercalcemia, Hum. mol. genet., 14, 90-1679, (2005)
[32] Prabhakaran, M., The distribution of physical, chemical and conformational properties in signal and nascent peptides, Biochem. J., 269, 6-691, (1990)
[33] Pugalenthi, G.; Kumar, K.K.; Suganthan, P.N.; Gangal, R., Identification of catalytic residues from protein structure using support vector machine with sequence and structural features, Biochem. biophys. res. commun., 367, 4-630, (2008)
[34] Rajpar, M.H.; Koch, M.J.; Davies, R.M.; Mellody, K.T.; Kielty, C.M.; Dixon, M.J., Mutation of the signal peptide region of the bicistronic gene DSPP affects translocation to the endoplasmic reticulum and results in defective dentine biomineralization, Hum. mol. genet., 11, 65-2559, (2002)
[35] Ramensky, V.; Bork, P.; Sunyaev, S., Human non-synonymous SNPs: server and survey, Nucleic acids res., 30, 900-3894, (2002)
[36] Raskutti, B.; Kowalczyk, A., Extreme re-balancing for SVMs: a case study, SIGKDD explor. newsl., 6, 60-69, (2004)
[37] Seppen, J.; Steenken, E.; Lindhout, D.; Bosma, P.J.; Elferink, R., A mutation which disrupts the hydrophobic core of the signal peptide of bilirubin UDP-glucuronosyltransferase, an endoplasmic reticulum membrane protein, causes crigler-najjar type II, FEBS. lett., 390, 8-294, (1996)
[38] Swets, J.A., Measuring the accuracy of diagnostic systems, Science, 240, 93-1285, (1988) · Zbl 1226.92048
[39] Thomas, P.D.; Campbell, M.J.; Kejariwal, A.; Mi, H.Y.; Karlak, B.; Daverman, R.; Diemer, K.; Muruganujan, A.; Narechania, A., PANTHER: a library of protein families and subfamilies indexed by function, Genome res., 13, 41-2129, (2003)
[40] Torkamani, A.; Schork, N.J., Accurate prediction of deleterious protein kinase polymorphisms, Bioinformatics, 23, 25-2918, (2007)
[41] Vihinen, M.; Torkkila, E.; Riikonen, P., Accuracy of protein flexibility predictions, Proteins, 19, 9-141, (1994)
[42] von Heijne, G., Signal sequences. the limits of variation, J. mol. biol., 184, 99-105, (1985)
[43] von Heijne, G., The signal peptide, J. membrane biol., 115, 195-201, (1990)
[44] Wu, C.H.; Apweiler, R.; Bairoch, A.; Natale, D.A.; Barker, W.C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.Z.; Lopez, R.; Magrane, M.; Martin, M.J.; Mazumder, R.; O’Donovan, C.; Redaschi, N.; Suzek, B., The universal protein resource (uniprot): an expanding universe of protein information, Nucleic acids res., 34, D91-D187, (2006)
[45] Wu, Z.C.; Xiao, X.; Chou, K.C., 2D-MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids, J. theor. biol., 267, 29-34, (2010)
[46] Xiao, J.M.; Li, Y.Z.; Wang, K.L.; Wen, Z.N.; Li, M.L.; Zhang, L.F.; Guang, X.M., In silico method for systematic analysis of feature importance in microrna – mrna interactions, BMC bioinformatics, 1, 0, (2009)
[47] Xiao, J.M.; Tang, X.J.; Li, Y.Z.; Fang, Z.; Ma, D.C.; He, Y.Z.G.; Li, M.L., Identification of microrna precursors based on random forest with network-level representation method of stem-loop structure, BMC bioinformatics, 1, 2, (2011)
[48] Xiao, X.; Lin, W.Z., Application of protein grey incidence degree measure to predict protein quaternary structural types, Amino acids, 37, 741-749, (2009)
[49] Xiao, X.; Wang, P.; Chou, K.C., GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes, J. comput. chem., 30, 1414-1423, (2009)
[50] Yip, Y.L.; Scheib, H.; Diemand, A.V.; Gattiker, A.; Famiglietti, L.M.; Gasteiger, E.; Bairoch, A., The swiss-prot variant page and the modsnp database: a resource for sequence and structure information on human protein variants, Hum. mutat., 23, 70-464, (2004)
[51] Yip, Y.L.; Famiglietti, M.; Gos, A.; Duek, P.D.; David, F.P.A.; Gateau, A.; Bairoch, A., Annotating single amino acid polymorphisms in the uniprot/swiss-prot knowledgebase, Hum. mutat., 29, 6-361, (2008)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.