HomeMachine LearningarXiv:2605.29645

The Sample Complexity of Multiclass and Sparse Contextual Bandits

Abstract

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set AA, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{ss-sparse} setting in which, for every context, the reward vector has L1L_1-norm at most sAs \ll |A|. Our main result is the design of algorithms that, with high probability, output an ϵ\epsilon-optimal policy compared to policy class Π\Pi using O~((s/ϵ2+A/ϵ)logΠ/δ)\tilde{O} ((s/\epsilon^2 + |A|/\epsilon)\log |\Pi|/\delta) samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional Θ(A9)\Theta(|A|^9) dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with ss-sparse rewards, the induced model class admits a sharp DEC bound that scales with ss and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

Cite

@article{arxiv.2605.29645,
  title  = {The Sample Complexity of Multiclass and Sparse Contextual Bandits},
  author = {Liad Erez and Fan Chen and Alon Cohen and Tomer Koren and Yishay Mansour and Shay Moran and Alexander Rakhlin},
  journal= {arXiv preprint arXiv:2605.29645},
  year   = {2026}
}