English

High-dimensional estimation with missing data: Statistical and computational limits

Statistics Theory 2026-03-18 v1 Data Structures and Algorithms Machine Learning Machine Learning Statistics Theory

Abstract

We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an ϵ\epsilon fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in 2\ell_2 norm, we show that in order to obtain error at most ρ\rho, for any constant contamination ϵ(0,1)\epsilon \in (0, 1), (roughly) nde1/ρ2n \gtrsim d e^{1/\rho^2} samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) nd1/ρ2n \gtrsim d^{1/\rho^2} and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.

Keywords

Cite

@article{arxiv.2603.16712,
  title  = {High-dimensional estimation with missing data: Statistical and computational limits},
  author = {Kabir Aladin Verchand and Ankit Pensia and Saminul Haque and Rohith Kuditipudi},
  journal= {arXiv preprint arXiv:2603.16712},
  year   = {2026}
}
R2 v1 2026-07-01T11:24:29.563Z