Related papers: Optimal anytime regret with two experts
Algorithms with predictions is a recent framework that has been used to overcome pessimistic worst-case bounds in incomplete information settings. In the context of scheduling, very recent work has leveraged machine-learned predictions to…
We propose an anytime online algorithm for the problem of learning a sequence of adversarial convex cost functions while approximately satisfying another sequence of adversarial online convex constraints. A sequential algorithm is called…
We consider online convex optimization with a zero-order oracle feedback. In particular, the decision maker does not know the explicit representation of the time-varying cost functions, or their gradients. At each time step, she observes…
Towards bridging classical optimal control and online learning, regret minimization has recently been proposed as a control design criterion. This competitive paradigm penalizes the loss relative to the optimal control actions chosen by a…
We study online convex optimization under stochastic sub-gradient observation faults, where we introduce adaptive algorithms with minimax optimal regret guarantees. We specifically study scenarios where our sub-gradient observations can be…
In online learning, the dynamic regret metric chooses the reference (optimal) solution that may change over time, while the typical (static) regret metric assumes the reference solution to be constant over the whole time horizon. The…
We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm…
Dealing with uncertainty in optimization parameters is an important and longstanding challenge. Typically, uncertain parameters are predicted accurately, and then a deterministic optimization problem is solved. However, the decisions…
We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021. We introduce a surprisingly simple and effective algorithm that simultaneously achieves minimax optimal…
We consider a setting where a system learns to rank a fixed set of $m$ items. The goal is produce good item rankings for users with diverse interests who interact online with the system for $T$ rounds. We consider a novel top-$1$ feedback…
Regret matching (RM) -- and its modern variants -- is a foundational online algorithm that has been at the heart of many AI breakthrough results in solving benchmark zero-sum games, such as poker. Yet, surprisingly little is known so far in…
We consider the problem of using observational bandit feedback data from multiple heterogeneous data sources to learn a personalized decision policy that robustly generalizes across diverse target settings. To achieve this, we propose a…
We show how to take any two parameter-free online learning algorithms with different regret guarantees and obtain a single algorithm whose regret is the minimum of the two base algorithms. Our method is embarrassingly simple: just add the…
We study the problem of minimising regret in two-armed bandit problems with Gaussian rewards. Our objective is to use this simple setting to illustrate that strategies based on an exploration phase (up to a stopping time) followed by…
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In…
Multi-armed bandit (MAB) algorithms have achieved significant success in sequential decision-making applications, under the premise that humans perfectly implement the recommended policy. However, existing methods often overlook the crucial…
We consider the classical stochastic multi-armed bandit problem with a constraint that limits the total cost incurred by switching between actions to be no larger than a given switching budget. For this problem, we prove matching upper and…
We investigate the problem of continuous-time causal estimation under a minimax criterion. Let $X^T = \{X_t,0\leq t\leq T\}$ be governed by the probability law $P_{\theta}$ from a class of possible laws indexed by $\theta \in \Lambda$, and…
Finding numerical approximations to minimax regret treatment rules is of key interest. To do so when potential outcomes are in {0,1} we discretize the action space of nature and apply a variant of Robinson's (1951) algorithm for iterative…
We consider the nonstochastic multi-agent multi-armed bandit problem with agents collaborating via a communication network with delays. We show a lower bound for individual regret of all agents. We show that with suitable regularizers and…