CUR matrix decompositions for improved data analysis.

*(English)*Zbl 1202.68480Summary: Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high “statistical leverage” and, thus, in a very precise statistical sense, exert a disproportionately large “influence” on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.

##### MSC:

68W20 | Randomized algorithms |

65F15 | Numerical computation of eigenvalues and eigenvectors of matrices |

68W25 | Approximation algorithms |

##### Keywords:

randomized algorithms; singular value decomposition; principal components analysis; interpretation; statistical leverage##### Software:

Algorithm 844
PDF
BibTeX
Cite

\textit{M. W. Mahoney} and \textit{P. Drineas}, Proc. Natl. Acad. Sci. USA 106, No. 3, 697--702 (2009; Zbl 1202.68480)

Full Text:
DOI

##### References:

[1] | Cho, Molecular cell 2 (1) pp 65– (1998) |

[2] | NUMER MATH 83 pp 313– (1999) · Zbl 0957.65031 |

[3] | LINEAR ALGEBRA AND ITS APPLICATIONS 261 pp 1– (1997) · Zbl 0877.65021 |

[4] | CONTEMP MATH 280 pp 47– (2001) |

[5] | J ACM 51 pp 1025– (2004) · Zbl 1125.65005 |

[6] | SIAM J COMPUT 36 pp 184– (2006) · Zbl 1111.68149 |

[7] | Genome Research 17 (1) pp 96– (2007) |

[8] | SIAM J MATRIX ANAL APPL 30 pp 844– (2008) · Zbl 1183.68738 |

[9] | The American Statistician 32 pp 17– (1978) |

[10] | 1 pp 379– (1986) |

[11] | The American Statistician 35 pp 234– (1981) |

[12] | Journal of the American Society for Information Science 41 pp 391– (1990) |

[13] | PNAS 101 (suppl_1) pp 5214– (2004) |

[14] | Alter, PNAS 97 (18) pp 10101– (2000) |

[15] | PNAS 97 (15) pp 8409– (2000) |

[16] | Nielsen, Lancet 359 (9314) pp 1301– (2002) |

[17] | AM J POLITICAL SCI 35 pp 228– (1991) |

[18] | PNAS 102 (20) pp 7057– (2005) |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.