The MM alternative to EM. (English) Zbl 1329.62106

Summary: The EM algorithm is a special case of a more general algorithm called the MM algorithm. Specific MM algorithms often have nothing to do with missing data. The first M step of an MM algorithm creates a surrogate function that is optimized in the second M step. In minimization, MM stands for majorize-minimize; in maximization, it stands for minorize-maximize. This two-step process always drives the objective function in the right direction. Construction of MM algorithms relies on recognizing and manipulating inequalities rather than calculating conditional expectations. This survey walks the reader through the construction of several specific MM algorithms. The potential of the MM algorithm in solving high-dimensional optimization and estimation problems is its most attractive feature. Our applications to random graph models, discriminant analysis and image restoration showcase this ability.


62F10 Point estimation
62-03 History of statistics
01A60 History of mathematics in the 20th century
62-02 Research exposition (monographs, survey articles) pertaining to statistics
Full Text: DOI arXiv Euclid


[1] Anderson, G. D., Vamanamurthy, M. K. and Vuorinen, M. (2007). Generalized convexity and inequalities. J. Math. Anal. Appl. 335 1294-1308. · Zbl 1125.26017
[2] Asuncion, A. and Newman, D. J. (2007). UCI Machine Learning Repository. Available at .
[3] Becker, M. P., Yang, I. and Lange, K. (1997). EM algorithms without missing data. Stat. Methods Med. Res. 6 38-54.
[4] Bergstrom, T. C. and Bagnoli, M. (2005). Log-concave probability and its applications. Econom. Theory 26 445-469. · Zbl 1077.60012
[5] Bijleveld, C. C. J. H. and de Leeuw, J. (1991). Fitting longitudinal reduced-rank regression models by alternating least squares. Psychometrika 56 433-447. · Zbl 0760.62062
[6] Bioucas-Dias, J. M., Figueiredo, M. A. T. and Oliveira, J. P. (2006). Total variation-based image deconvolution: a majorization-minimization approach. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2006 Proceedings 861-864.
[7] Blitzstein, J., Chatterjee, S. and Diaconis, P. (2008). A new algorithm for high dimensional maximum likelihood estimation. Technical report.
[8] Bohning, D. and Lindsay, B. G. (1988). Monotonicity of quadratic approximation algorithms. Ann. Inst. Statist. Math. 40 641-663. · Zbl 0723.65150
[9] Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: Theory and Applications . Springer, New York. · Zbl 0862.62052
[10] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization . Cambridge Univ. Press. · Zbl 1058.90049
[11] Boyles, R. A. (1983). On the convergence of the EM algorithm. J. Roy. Statist. Soc. Ser. B 45 47-50. JSTOR: · Zbl 0508.62030
[12] de Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling. In Recent Developments in Statistics (J. R. Barra, F. Brodeau, G. Romie and B. Van Cutsem, eds.) 133-145. North-Holland, Amsterdam. · Zbl 0372.62083
[13] de Leeuw, J. (1994). Block relaxation algorithms in statistics. In Information Systems and Data Analysis (H.-H. Bock, W. Lenski and M. M. Richter, eds.). Springer, Berlin. · Zbl 0829.65144
[14] de Leeuw, J. and Heiser, W. J. (1977). Convergence of correction matrix algorithms for multidimensional scaling. In Geometric Representations of Relational Data (J. C. Lingoes, E. Roskam and I. Borg, eds.). Mathesis Press, Ann Arbor, MI.
[15] de Leeuw, J. and Heiser, W. J. (1980). Multidimensional scaling with restriction on the configuration. In Multivariate Analysis - V: Proceeding of the Fifth International Symposium on Multivariate Analysis (P. R. Krishnaiah ed.) 501-522. North-Holland, Amsterdam. · Zbl 0468.62054
[16] de Leeuw, J. and Lange, K. (2009). Sharp quadratic majorization in one dimension. Comput. Statist. Data Anal. 53 2471-2484. · Zbl 1453.62078
[17] De Pierro, A. R. (1995). A modified expectation maximization algorithm for penalized likelihood estimation in emission tomography. IEEE Trans. Med. Imaging 14 132-137.
[18] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1-38. JSTOR: · Zbl 0364.62022
[19] Eldén, L. (2007). Matrix Methods in Data Mining and Pattern Recognition . SIAM, Philadelphia. · Zbl 1120.68092
[20] Friedman, J., Hastie, T. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist. 1 302-332. · Zbl 1378.90064
[21] Groenen, P. J. F., Nalbantov, G. and Bioch, J. C. (2006). Nonlinear support vector machines through iterative majorization and I-splines. Studies in Classification, Data Analysis and Knowledge Organization (H. J. Lenz and R. Decker, eds.) 149-161. Springer, Berlin.
[22] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction . Springer, New York. · Zbl 0973.62007
[23] Heiser, W. J. (1987). Correspondence analysis with least absolute residuals. Comput. Statist. Data Anal. 5 337-356. · Zbl 0624.62052
[24] Heiser, W. J. (1995). Convergent computing by iterative majorization: theory and applications in multidimensional data analysis. In Recent Advances in Descriptive Multivariate Analysis (W. J. Krzanowski, ed.). Clarendon Press, Oxford.
[25] Huber, P. J. (1981). Robust Statistics . Wiley, New York. · Zbl 0536.62025
[26] Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry models. Ann. Statist. 32 384-406. · Zbl 1105.62359
[27] Hunter, D. R. and Lange, K. (2000). Quantile regression via an MM algorithm. J. Comput. Graph. Statist. 9 60-77. JSTOR:
[28] Hunter, D. R. and Lange, K. (2002). Computing estimates in the proportional odds model. Ann. Inst. Statist. Math. 54 155-168. · Zbl 0991.62080
[29] Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. Amer. Statist. 58 30-37. · Zbl 05680564
[30] Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33 1617-1642. · Zbl 1078.62028
[31] Jamshidian, M. and Jennrich, R. I. (1997). Quasi-Newton acceleration of the EM algorithm. J. Roy. Statist. Soc. Ser. B 59 569-587. JSTOR: · Zbl 0889.62042
[32] Kent, J. T., Tyler, D. E. and Vardi, Y. (1994). A curious likelihood identity for the multivariate t -distribution. Comm. Statist. Simulation Comput. 23 441-453. · Zbl 0825.62035
[33] Kiers, H. A. L. (2002). Setting up alternating least squares and iterative majorization algorithms for solving various matrix optimization problems. Comput. Statist. Data Anal. 41 157-170. · Zbl 1018.65074
[34] Kiers, H. A. L. and Ten Berge, J. M. F. (1992). Minimization of a class of matrix trace functions by means of refined majorization. Psychometrika 57 371-382. · Zbl 0782.62067
[35] Lange, K. (1994). An adaptive barrier method for convex programming. Methods Appl. Anal. 1 392-402. · Zbl 0835.90068
[36] Lange, K. (1995a). A gradient algorithm locally equivalent to the EM algorithm. J. Roy. Statist. Soc. Ser. B 57 425-437. JSTOR: · Zbl 0813.62021
[37] Lange, K. (1995b). A quasi-Newton acceleration of the EM algorithm. Statist. Sinica 5 1-18. · Zbl 0824.62017
[38] Lange, K. (2004). Optimization . Springer, New York. · Zbl 1140.90004
[39] Lange, K. and Fessler, J. A. (1994). Globally convergent algorithms for maximum a posteriori transmission tomography. IEEE Trans. Image Process. 4 1430-1438.
[40] Lange, K., Hunter, D. R. and Yang, I. (2000). Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist. 9 1-20. JSTOR:
[41] Lange, K., Little, R. J. A. and Taylor, J. M. G. (1989). Robust statistical modeling using the t distribution. J. Amer. Statist. Assoc. 84 881-896. JSTOR:
[42] Lange, K. and Wu, T. T. (2008). An MM algorithm for multicategory vertex discriminant analysis. J. Comput. Graph. Statist. 17 1-18.
[43] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788-791. · Zbl 1055.81054
[44] Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Adv. Neural Inform. Process. Syst. 13 556-562.
[45] Liao, W. H., Huang, S. C., Lange, K. and Bergsneider, M. (2002). Use of MM algorithm for regularization of parametric images in dynamic PET. In Brain Imaging Using PET (M. Senda, Y. Kimura, P. Herscovitch and Y. Kimura, eds.). Academic Press, New York.
[46] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data , 2nd ed. Wiley, New York. · Zbl 1011.62004
[47] Marshall, A. W. and Olkin, I. (1979). Inequalities: Theory of Majorization and Its Applications . Academic Press, San Diego. · Zbl 0437.26007
[48] McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions . Wiley, New York. · Zbl 0882.62012
[49] Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. J. Roy. Statist. Soc. B 51 127-138. JSTOR: · Zbl 0674.65118
[50] Meng, X. L. and Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. J. Amer. Statist. Assoc. 86 899-909.
[51] Meng, X. L. and van Dyk, D. (1997). The EM algorithm-an old folk-song sung to a fast new tune. J. Roy. Statist. Soc. Ser. B 59 511-567. JSTOR: · Zbl 1090.62518
[52] Ortega, J. M. (1990). Numerical Analysis: A Second Course . SIAM, Philadelphia. · Zbl 0701.65002
[53] Ortega, J. M. and Rheinboldt, W. C. (1970). Iterative Solutions of Nonlinear Equations in Several Variables . Academic Press, New York. · Zbl 0241.65046
[54] Pauca, V. P., Piper, J. and Plemmons, R. J. (2006). Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl. 416 29-47. · Zbl 1096.65033
[55] Rao, C. R. (1973). Linear Statistical Inference and Its Applications , 2nd ed. Wiley, New York. · Zbl 0256.62002
[56] Sabatti, C. and Lange, K. (2002). Genomewide motif identification using a dictionary model. Proceedings IEEE 90 1803-1810.
[57] Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond . MIT Press, Cambridge.
[58] Steele, J. M. (2004). The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities . Cambridge Univ. Press and Math. Assoc. Amer., Washington, DC. · Zbl 1060.26023
[59] Takane, Y., Young, F. W. and de Leeuw, J. (1977). Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features. Psychometrika 42 7-67. · Zbl 0354.92048
[60] Vaida, F. (2005). Parameter convergence for EM and MM algorithms. Statist. Sinica 15 831-840. · Zbl 1087.62035
[61] Vapnik, V. (1995). The Nature of Statistical Learning Theory . Springer, New York. · Zbl 0833.62008
[62] Varadhan, R. and Roland, C. (2008). Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scand. J. Statist. 35 335-353. · Zbl 1164.65006
[63] Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Ann. Statist. 11 95-103. · Zbl 0517.62035
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.