×

A non-negative matrix factorization framework for identifying modular patterns in metagenomic profile data. (English) Zbl 1252.92023

Summary: Metagenomic studies sequence DNA directly from environmental samples to explore the structure and function of complex microbial and viral communities. Individual short pieces of sequenced DNA (“reads”) are classified into (putative) taxonomic or metabolic groups which are analyzed for patterns across samples. Analysis of such read matrices is at the core of using metagenomic data to make inferences about ecosystem structure and function. Non-negative matrix factorization (NMF) is a numerical technique for approximating high-dimensional data points as positive linear combinations of positive components. It is thus well suited to interpretation of observed samples as combinations of different components.
We develop tests and apply an NMF-based framework to analyze metagenomic read matrices. In particular, we introduce a method for choosing the NMF degree in the presence of overlap, and apply spectral-reordering techniques to NMF-based similarity matrices to aid visualization. We show that our method can robustly identify the appropriate degree and disentangle overlapping contributions using synthetic data sets. We then examine and discuss the NMF decomposition of a metabolic profile matrix extracted from 39 publicly available metagenomic samples, and identify canonical sample types, including one associated with coral ecosystems, one associated with highly saline ecosystems and others. We also identify specific associations between pathways and canonical environments, and explore how alternative choices of decompositions facilitate analysis of read matrices at a finer scale.

MSC:

92C40 Biochemistry, molecular biology
15A23 Factorization of matrices
92D40 Ecology
65C20 Probabilistic models, generic numerical methods in probability and statistics

Software:

ONE; RAST; MEGAN; NMF; bioNMF
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Alzate C, Suykens JA (2010) Multiway spectral clustering with out-of-sample extensions through weighted kernel PCA. IEEE Trans Pattern Anal Mach Intell 32: 335–347 · doi:10.1109/TPAMI.2008.292
[2] Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 101: 4164–4169 · doi:10.1073/pnas.0308531101
[3] Desnues C, Brito B, Rayhawk S, Kelley S, Tran T, Haynes M, Liu H, Furlan M, Wegley L, Chau B, Ruan Y, Hall D, Angly FE, Edwards RA, Li L, Thurber RV, Reid RP, Siefert J, Souza V, Valentine DL, Swan BK, Breitbart M, Rohwer F (2008) Biodiversity and biogeography of phages in modern stromatolites and thrombolites. Nature 452: 340–343 · doi:10.1038/nature06735
[4] Devarajan K (2008) Nonnegative matrix factorization: an analytical and interpretive tool in computational biology. PLoS Comput Biol 4: e100029 · doi:10.1371/journal.pcbi.1000029
[5] Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RV, Wegley L, White BA, Rohwer F (2008) Functional metagenomic profiling of nine biomes. Nature 452: 629–632 · doi:10.1038/nature06810
[6] Gaujoux R, Seoighe C (2010) A flexible R package for nonnegative matrix factorization. BMC Bioinform 11: 367 · doi:10.1186/1471-2105-11-367
[7] Gianoulis TA, Raes J, Patel PV, Bjornson R, Korbel JO, Letunic I, Yamada T, Paccanaro A, Jensen LJ, Snyder M, Bork P, Gerstein MB (2009) Quantifying environmental adaptation of metabolic pathways in metagenomics. Proc Natl Acad Sci USA 106: 1374–1379 · doi:10.1073/pnas.0808022106
[8] Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE (2006) Metagenomic Analysis of the Human Distal Gut Microbiome. Science 312(5778):1355–1359. http://10.1126/science.1124234
[9] Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68(4):669–685. http://10.1128/MMBR.68.4.669-685.2004
[10] Hemme CL, Deng Y, Gentry TJ, Fields MW, Wu L, Barua S, Barry K, Tringe SG, Watson DB, He Z, Hazen TC, Tiedje JM, Rubin EM, Zhou J (2010) Metagenomic insights into evolution of a heavy metal-contaminated groundwater microbial community. ISME J 4: 660–672 · doi:10.1038/ismej.2009.154
[11] Hollister EB, Engledow AS, Hammett AJ, Provin TL, Wilkinson HH, Gentry TJ (2010) Shifts in microbial community structure along an ecological gradient of hypersaline soils and sediments. ISME J 4: 829–838 · doi:10.1038/ismej.2010.3
[12] Huson DH, Auch AF, Qi J, Schuster SC (2007) MEGAN analysis of metagenomic data. Genome Res 17(3): 377–386. doi: 10.1101/gr.5969107 · doi:10.1101/gr.5969107
[13] Kelley DR, Salzberg SL (2010) Clustering metagenomic sequences with interpolated Markov models. BMC Bioinform 11. doi: 10.1186/1471-2105-11-544
[14] Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23: 1495–1502 · Zbl 05324833 · doi:10.1093/bioinformatics/btm134
[15] Kim PM, Tidor B (2003) Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res 13: 1706–1718 · doi:10.1101/gr.903503
[16] Kislyuk A, Bhatnagar S, Dushoff J, Weitz JS (2009) Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform 10. doi: 10.1186/1471-2105-10-316
[17] Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13: 703–716 · doi:10.1101/gr.648603
[18] Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791 · Zbl 1369.68285 · doi:10.1038/44565
[19] Levin SA (2006) Fundamental questions in biology. PLoS Biol 4: e300 · doi:10.1371/journal.pbio.0040300
[20] Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1: 24–45 · Zbl 05103330 · doi:10.1109/TCBB.2004.2
[21] Maetschke SR, Kassahn KS, Dunn JA, Han SP, Curley EZ, Stacey KJ, Ragan MA (2010) A visual framework for sequence analysis using n-grams and spectral rearrangement. Bioinformatics 26: 737–744 · Zbl 05744595 · doi:10.1093/bioinformatics/btq042
[22] Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy ACC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NCC (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4:495–500. http://10.1038/nmeth1043
[23] McHardy AC, Garcia Martin H, Tsirigos A, Hugenholtz P, Rigoutsos I (2007) Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 4(1): 63–72. doi: 10.1038/NMETH976 · doi:10.1038/nmeth976
[24] Meyer F, Paarmann D, Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA (2008) The metagenomics RAST server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform 9: 386 · doi:10.1186/1471-2105-9-386
[25] Montano A, Saez P, Chagoyen M, Tirado F, Carazo JM, Marqui RD (2006) bioNMF: a versatile tool for non-negative matrix factorization in biology. BMC Bioinform 7: 366 · doi:10.1186/1471-2105-7-366
[26] Montano A, Carazo JM, Kochi K, Lehmann D, Marqui RD (2006) Nonsmooth nonnegative matrix factorization (nsNMF). IEEE Trans Pattern Anal Mach Intell 28: 403–415 · Zbl 05110885 · doi:10.1109/TPAMI.2006.60
[27] Morgan JL, Darling AE, Eisen JA (2010) Metagenomic sequencing of an in vitro-simulated microbial community. PLoS One 5: e10209 · doi:10.1371/journal.pone.0010209
[28] Parks DH, Beiko RG (2010) Identifying biologically relevant differences between metagenomic communities. Bioinformatics 26: 715–721 · Zbl 05744592 · doi:10.1093/bioinformatics/btq041
[29] Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C, Baker CC, Di F, Howcroft TK, Karp RW, Lunsford RD, Wellington CR, Belachew T, Wright M, Giblin C, David H, Mills M, Salomon R, Mullins C, Akolkar B, Begg L, Davis C, Grandison L, Humble M, Khalsa J, Little AR, Peavy H, Pontzer C, Portnoy M, Sayre MH, Reed P, Zakhari S, Read J, Watson B, Guyer M (2009) The NIH human microbiome project. Genome Res 19: 2317–2323 · doi:10.1101/gr.096651.109
[30] Quince C, Lanzen A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT (2009) Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods 6: 639–641 · doi:10.1038/nmeth.1361
[31] R Development Core Team (2010) R Project for Statistical Computing. http://www.r-project.org/
[32] Richter DC, Ott F, Auch AF, Schmid R, Huson DH (2008) MetaSimA Sequencing Simulator for Genomics and Metagenomics. PLoS One 3(10):e3373+. http://10.1371/journal.pone.0003373
[33] Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcon LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, Platt T, Bermingham E, Gallardo V, Tamayo-Castillo G, Ferrari MR, Strausberg RL, Nealson K, Friedman R, Frazier M, Venter CJ (2007) The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol 5(3):e77+. http://10.1371/journal.pbio.0050077
[34] Saez P, Marqui RD, Tirado F, Carazo JM, Montano A (2006) Biclustering of gene expression data by Non-smooth Non-negative Matrix Factorization. BMC Bioinform 7: 78 · doi:10.1186/1471-2105-7-78
[35] Sogin MLL, Morrison HGG, Huber JAA, Welch DMM, Huse SMM, Neal PRR, Arrieta JMM, Herndl GJJ (2006) Microbial diversity in the deep sea and the underexplored ”rare biosphere”. Proc Natl Acad Sci 103:12115–12120. http://10.1073/pnas.0605127103
[36] Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM (2005) Comparative Metagenomics of Microbial Communities. Science 308(5721):554–557. http://10.1126/science.1107851
[37] Turnbaugh PJ, Gordon JI (2008) An invitation to the marriage of metagenomics and metabolomics. Cell 134: 708–713 · doi:10.1016/j.cell.2008.08.025
[38] Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC, Knight R, Gordon JI (2009) A core gut microbiome in obese and lean twins. Nature 457: 480–484 · doi:10.1038/nature07540
[39] Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978):37–43. http://10.1038/nature02340
[40] Warnecke F, Luginbühl P, Ivanova N, Ghassemian M, Richardson TH, Stege JT, Cayouette M, Mchardy AC, Djordjevic G, Aboushadi N, Sorek R, Tringe SG, Podar M, Martin HG, Kunin V, Dalevi D, Madejska J, Kirton E, Platt D, Szeto E, Salamov A, Barry K, Mikhailova N, Kyrpides NC, Matson EG, Ottesen EA, Zhang X, Hernández M, Murillo C, Acosta LG, Rigoutsos I, Tamayo G, Green BD, Chang C, Rubin EM, Mathur EJ, Robertson DE, Hugenholtz P, Leadbetter JR (2007) Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. Nature 450(7169):560–565. http://dx.doi.org/10.1038/nature06269
[41] Willner D, Furlan M, Haynes M, Schmieder R, Angly FE, Silva J, Tammadoni S, Nosrat B, Conrad D, Rohwer F (2009) Metagenomic analysis of respiratory tract DNA viral communities in cystic fibrosis and non-cystic fibrosis individuals. PLoS One 4: e7370 · doi:10.1371/journal.pone.0007370
[42] Willner D, Thurber RV, Rohwer F (2009) Metagenomic signatures of 86 microbial and viral metagenomes. Environ Microbiol 11: 1752–1766 · doi:10.1111/j.1462-2920.2009.01901.x
[43] Zelnik-Manor L, Perona P (2004) Self-Tuning Spectral Clustering. In: Eighteenth Annual Conference on Neural Information Processing Systems, (NIPS)
[44] Zhang S, Wang RS, Zhang XS (2007) Uncovering fuzzy community structure in complex networks. Phys Rev E Stat Nonlin Soft Matter Phys 76: 046103 · doi:10.1103/PhysRevE.76.046103
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.