English

An Improved Approximation Algorithm for the Column Subset Selection Problem

Data Structures and Algorithms 2015-03-13 v2

Abstract

We consider the problem of selecting the best subset of exactly kk columns from an m×nm \times n matrix AA. We present and analyze a novel two-stage algorithm that runs in O(min{mn2,m2n})O(\min\{mn^2,m^2n\}) time and returns as output an m×km \times k matrix CC consisting of exactly kk columns of AA. In the first (randomized) stage, the algorithm randomly selects Θ(klogk)\Theta(k \log k) columns according to a judiciously-chosen probability distribution that depends on information in the top-kk right singular subspace of AA. In the second (deterministic) stage, the algorithm applies a deterministic column-selection procedure to select and return exactly kk columns from the set of columns selected in the first stage. Let CC be the m×km \times k matrix containing those kk columns, let PCP_C denote the projection matrix onto the span of those columns, and let AkA_k denote the best rank-kk approximation to the matrix AA. Then, we prove that, with probability at least 0.8, \FNormAPCAΘ(klog1/2k)\FNormAAk. \FNorm{A - P_CA} \leq \Theta(k \log^{1/2} k) \FNorm{A-A_k}. This Frobenius norm bound is only a factor of klogk\sqrt{k \log k} worse than the best previously existing existential result and is roughly O(k!)O(\sqrt{k!}) better than the best previous algorithmic result for the Frobenius norm version of this Column Subset Selection Problem (CSSP). We also prove that, with probability at least 0.8, \TNormAPCAΘ(klog1/2k)\TNormAAk+Θ(k3/4log1/4k)\FNormAAk. \TNorm{A - P_CA} \leq \Theta(k \log^{1/2} k)\TNorm{A-A_k} + \Theta(k^{3/4}\log^{1/4}k)\FNorm{A-A_k}. This spectral norm bound is not directly comparable to the best previously existing bounds for the spectral norm version of this CSSP. Our bound depends on \FNormAAk\FNorm{A-A_k}, whereas previous results depend on nk\TNormAAk\sqrt{n-k}\TNorm{A-A_k}; if these two quantities are comparable, then our bound is asymptotically worse by a (klogk)1/4(k \log k)^{1/4} factor.

Keywords

Cite

@article{arxiv.0812.4293,
  title  = {An Improved Approximation Algorithm for the Column Subset Selection Problem},
  author = {Christos Boutsidis and Michael W. Mahoney and Petros Drineas},
  journal= {arXiv preprint arXiv:0812.4293},
  year   = {2015}
}

Comments

17 pages; corrected a bug in the spectral norm bound of the previous version

R2 v1 2026-06-21T11:55:07.156Z