zbMATH — the first resource for mathematics

ModEnzA: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. (English) Zbl 1219.92019
Summary: Various enzyme identification protocols involving homology transfer by sequence-sequence or profile-sequence comparisons have been devised which utilise Swiss-Prot sequences associated with EC numbers as the training set. A profile HMM constructed for a particular EC number might select sequences which perform a different enzymatic function due to the presence of certain fold-specific residues which are conserved in enzymes sharing a common fold. We describe a protocol, ModEnzA (HMM-ModE Enzyme Annotation), which generates profile HMMs highly specific at a functional level as defined by the EC numbers by incorporating information from negative training sequences. We enrich the training data set by mining sequences from the NCBI non-redundant database for increased sensitivity. We compare our method with other enzyme identification methods, both for assigning EC numbers to a genome as well as identifying protein sequences associated with an enzymatic activity. We report a sensitivity of 88% and specificity of 95% in identifying EC numbers and annotating enzymatic sequences from the E. coli genome which is higher than any other method. With the next-generation sequencing methods producing a huge amount of sequence data, the development and use of fully automated yet accurate protocols such as ModEnzA is warranted for rapid annotation of newly sequenced genomes and metagenomic sequences.
92C40 Biochemistry, molecular biology
92-08 Computational methods for problems pertaining to biology
Full Text: DOI
[1] D. MacLean, J. D. G. Jones, and D. J. Studholme, “Application of ’next-generation’ sequencing technologies to microbial genetics,” Nature Reviews Microbiology, vol. 7, no. 4, pp. 287-296, 2009. · doi:10.1038/nrmicro2088
[2] M. Y. Galperin and E. V. Koonin, “Searching for drug targets in microbial genomes,” Current Opinion in Biotechnology, vol. 10, no. 6, pp. 571-578, 1999. · doi:10.1016/S0958-1669(99)00035-X
[3] A. L. Hopkins and C. R. Groom, “The druggable genome,” Nature Reviews Drug Discovery, vol. 1, no. 9, pp. 727-730, 2002.
[4] A. P. Russ and S. Lampel, “The druggable genome: an update,” Drug Discovery Today, vol. 10, no. 23-24, pp. 1607-1610, 2005. · doi:10.1016/S1359-6446(05)03666-4
[5] I. Yeh, T. Hanekamp, S. Tsoka, P. D. Karp, and R. B. Altman, “Computational analysis of Plasmodium falciparum metabolism: organizing genomic information to facilitate drug discovery,” Genome Research, vol. 14, no. 5, pp. 917-924, 2004. · doi:10.1101/gr.2050304
[6] H. P. Price, M. R. Menon, C. Panethymitaki, D. Goulding, P. G. McKean, and D. F. Smith, “Myristoyl-CoA: protein N-myristoyltransferase, an essential enzyme and potential drug target in kinetoplastid parasites,” Journal of Biological Chemistry, vol. 278, no. 9, pp. 7206-7214, 2003. · doi:10.1074/jbc.M211391200
[7] P. Upcroft and J. A. Upcroft, “Drug targets and mechanisms of resistance in the anaerobic protozoa,” Clinical Microbiology Reviews, vol. 14, no. 1, pp. 150-164, 2001. · doi:10.1128/CMR.14.1.150-164.2001
[8] S. Hasan, S. Daugelat, P. S. S. Rao, and M. Schreiber, “Prioritizing genomic drug targets in pathogens: application to Mycobacterium tuberculosis,” PLoS Computational Biology, vol. 2, no. 6, pp. 539-550, 2006. · doi:10.1371/journal.pcbi.0020061
[9] S. Anishetty, M. Pulimi, and G. Pennathur, “Potential drug targets in Mycobacterium tuberculosis through metabolic pathway analysis,” Computational Biology and Chemistry, vol. 29, no. 5, pp. 368-378, 2005. · Zbl 1088.92013 · doi:10.1016/j.compbiolchem.2005.07.001
[10] A. Rodaki, T. Young, and A. J. P. Brown, “Effects of depleting the essential central metabolic enzyme fructose-1,6-bisphosphate aldolase on the growth and viability of Candida albicans: implications for antifungal drug target discovery,” Eukaryotic Cell, vol. 5, no. 8, pp. 1371-1377, 2006. · doi:10.1128/EC.00115-06
[11] E. Morgunova, S. Saller, I. Haase et al., “Lumazine synthase from Candida albicans as an anti-fungal target enzyme: structural and biochemical basis for drug design,” Journal of Biological Chemistry, vol. 282, no. 23, pp. 17231-17241, 2007. · doi:10.1074/jbc.M701724200
[12] D. Xu, B. Jiang, T. Ketela et al., “Genome-wide fitness test and mechanism-of-action studies of inhibitory compounds in Candida albicans.,” PLoS Pathogens, vol. 3, no. 6, article e92, 2007. · doi:10.1371/journal.ppat.0030092
[13] D. B. Rusch, A. L. Halpern, G. Sutton et al., “The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific.,” PLoS Biology, vol. 5, no. 3, article e77, 2007. · doi:10.1371/journal.pbio.0050077
[14] T. A. Gianoulis, J. Raes, P. V. Patel et al., “Quantifying environmental adaptation of metabolic pathways in metagenomics,” Proceedings of the National Academy of Sciences of the United States of America, vol. 106, no. 5, pp. 1374-1379, 2009. · doi:10.1073/pnas.0808022106
[15] M. Kanehisa, S. Goto, M. Hattori et al., “From genomics to chemical genomics: new developments in KEGG,” Nucleic Acids Research, vol. 34, pp. D354-D357, 2006.
[16] N. Maltsev, E. Glass, D. Sulakhe et al., “PUMA2-grid-based high-throughput analysis of genomes and metabolic pathways,” Nucleic Acids Research, vol. 34, pp. D369-D372, 2006.
[17] R. Caspi, H. Foerster, C. A. Fulcher et al., “The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases,” Nucleic Acids Research, vol. 36, no. 1, pp. D623-D631, 2008. · Zbl 05438351 · doi:10.1093/nar/gkm900
[18] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990. · doi:10.1006/jmbi.1990.9999
[19] W. R. Pearson and D. J. Lipman, “Improved tools for biological sequence comparison,” Proceedings of the National Academy of Sciences of the United States of America, vol. 85, no. 8, pp. 2444-2448, 1988.
[20] S. F. Altschul, T. L. Madden, A. A. Schäffer et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997. · doi:10.1093/nar/25.17.3389
[21] S. R. Eddy, “HMMER: biological sequence analysis using profile hidden Markov models,” 1998, http://hmmer.org/.
[22] C. Claudel-Renard, C. Chevalet, T. Faraut, and D. Kahn, “Enzyme-specific profiles for metabolic pathway prediction: PRIAM,” Nucleic Acids Research, vol. 31, no. 22, pp. 6633-6639, 2003.
[23] A. Marchler-Bauer, A. R. Panchenko, B. A. Shoemarker, P. A. Thiessen, L. Y. Geer, and S. H. Bryant, “CDD: a database of conserved domain alignments with links to domain three-dimensional structure,” Nucleic Acids Research, vol. 30, no. 1, pp. 281-283, 2002.
[24] J. W. Pinney, M. W. Shirley, G. A. McConkey, and D. R. Westhead, “metaSHARK: software for automated metabolic network prediction from DNA sequence and its application to the genomes of Plasmodium falciparum and Eimeria tenella,” Nucleic Acids Research, vol. 33, no. 4, pp. 1399-1409, 2005. · doi:10.1093/nar/gki285
[25] W. Tian, A. K. Arakaki, and J. Skolnick, “EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference,” Nucleic Acids Research, vol. 32, no. 21, pp. 6226-6239, 2004. · doi:10.1093/nar/gkh956
[26] A. K. Arakaki, Y. Huang, and J. Skolnick, “EFICAz: enzyme function inference by a combined approach enhanced by machine learning,” BMC Bioinformatics, vol. 10, article 107, 2009. · Zbl 05739476 · doi:10.1186/1471-2105-10-107
[27] J. M. Peregrin-Alvarez, S. Tsoka, and C. A. Ouzounis, “The phylogenetic extent of metabolic enzymes and pathways,” Genome Research, vol. 13, no. 3, pp. 422-427, 2003. · doi:10.1101/gr.246903
[28] A. Bairoch, “The ENZYME database in 2000,” Nucleic Acids Research, vol. 28, no. 1, pp. 304-305, 2000.
[29] M. L. Green and P. D. Karp, “A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases,” BMC Bioinformatics, vol. 5, article 76, 2004. · doi:10.1186/1471-2105-5-76
[30] B. P. Kelley, B. Yuan, F. Lewitter, R. Sharan, B. R. Stockwell, and T. Ideker, “PathBLAST: a tool for alignment of protein interaction networks,” Nucleic Acids Research, vol. 32, pp. W83-W88, 2004. · Zbl 05435975 · doi:10.1093/nar/gkh411
[31] Y. Ye, A. Osterman, R. Overbeek, and A. Godzik, “Automatic detection of subsystem/pathway variants in genome analysis,” Bioinformatics, vol. 21, no. 1, pp. 478-486, 2005. · doi:10.1093/bioinformatics/bti1052
[32] R. Overbeek, M. Fonstein, M. D’Souza, G. D. Push, and N. Maltsev, “The use of gene clusters to infer functional coupling,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 6, pp. 2896-2901, 1999. · doi:10.1073/pnas.96.6.2896
[33] P. Kharchenko, D. Vitkup, and G. M. Church, “Filling gaps in a metabolic network using expression information,” Bioinformatics, vol. 20, no. 1, pp. 178-185, 2004. · doi:10.1093/bioinformatics/bth930
[34] P. Kharchenko, L. Chen, Y. Freund, D. Vitkup, and G. M. Church, “Identifying metabolic enzymes with multiple types of association evidence,” BMC Bioinformatics, vol. 7, article 177, 2006. · Zbl 05326096 · doi:10.1186/1471-2105-7-177
[35] P. K. Srivastava, D. K. Desai, S. Nandi, and A. M. Lynn, “HMM-ModE-improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences,” BMC Bioinformatics, vol. 8, article 104, 2007. · doi:10.1186/1471-2105-8-104
[36] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, “SCOP: a structural classification of proteins database for the investigation of sequences and structures,” Journal of Molecular Biology, vol. 247, no. 4, pp. 536-540, 1995. · doi:10.1006/jmbi.1995.0159
[37] S. van Dongen, Graph clustering by flow simulation, Ph.D. thesis, University of Utrecht, May 2000.
[38] R. C. Edgar, “MUSCLE: multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Research, vol. 32, no. 5, pp. 1792-1797, 2004. · doi:10.1093/nar/gkh340
[39] P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen, “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, vol. 16, no. 5, pp. 412-424, 2000.
[40] A. Gattiker, K. Michoud, C. Rivoire et al., “Automated annotation of microbial proteomes in SWISS-PROT,” Computational Biology and Chemistry, vol. 27, no. 1, pp. 49-58, 2003. · doi:10.1016/S1476-9271(02)00094-4
[41] R. Apweiler, A. Bairoch, C. H. Wu et al., “UniProt: the universal protein knowledgebase,” Nucleic Acids Research, vol. 32, pp. D115-D119, 2004.
[42] H. M. Berman, J. Westbrook, Z. Feng et al., “The protein data bank,” Nucleic Acids Research, vol. 28, no. 1, pp. 235-242, 2000.
[43] A. J. Enright, S. van Dongen, and C. A. Ouzounis, “An efficient algorithm for large-scale detection of protein families,” Nucleic Acids Research, vol. 30, no. 7, pp. 1575-1584, 2002.
[44] A. Bahl, B. Brunk, J. Crabtree et al., “PlasmoDB: the Plasmodium genome resource. A database integrating experimental and computational data,” Nucleic Acids Research, vol. 31, no. 1, pp. 212-215, 2003. · Zbl 05434900 · doi:10.1093/nar/gkg081
[45] S. S. Hannenhalli and R. B. Russell, “Analysis and prediction of functional sub-types from protein sequence alignments,” Journal of Molecular Biology, vol. 303, no. 1, pp. 61-76, 2000. · doi:10.1006/jmbi.2000.4036
[46] L. Li, E. I. Shakhnovich, and L. A. Mirny, “Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 8, pp. 4463-4468, 2003. · doi:10.1073/pnas.0737647100
[47] M. J. Gardner, N. Hall, E. Fung et al., “Genome sequence of the human malaria parasite Plasmodium falciparum,” Nature, vol. 419, no. 6906, pp. 498-511, 2002. · doi:10.1038/nature01097
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.