Related papers: An efficient algorithm for learning with semi-band…
We propose a sample-efficient alternative for importance weighting for situations where one only has sample access to the probability distribution that generates the observations. Our new method, called Geometric Resampling (GR), is…
We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions.…
We consider the problem of asynchronous online combinatorial optimization on a network of communicating agents. At each time step, some of the agents are stochastically activated, requested to make a prediction, and the system pays the…
This paper studies the optimality and complexity of Follow-the-Perturbed-Leader (FTPL) policy in $m$-set semi-bandit problems. FTPL has been studied extensively as a promising candidate of an efficient algorithm with favorable regret for…
We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation…
We address online linear optimization problems when the possible actions of the decision maker are represented by binary vectors. The regret of the decision maker is the difference between her realized loss and the best loss she would have…
This paper studies the optimality and complexity of Follow-the-Perturbed-Leader (FTPL) policy in size-invariant combinatorial semi-bandit problems. Recently, Honda et al. (2023) and Lee et al. (2024) showed that FTPL achieves…
We consider a common case of the combinatorial semi-bandit problem, the $m$-set semi-bandit, where the learner exactly selects $m$ arms from the total $d$ arms. In the adversarial setting, the best regret bound, known to be…
We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with delayed bandit feedback. By separating the cost of delayed feedback from that of bandit feedback, our analysis allows us to obtain new results in…
In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this…
We study the problem of online learning with non-convex losses, where the learner has access to an offline optimization oracle. We show that the classical Follow the Perturbed Leader (FTPL) algorithm achieves optimal regret rate of…
A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as…
We consider a partial-feedback variant of the well-studied online PCA problem where a learner attempts to predict a sequence of $d$-dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment's…
We provide the first algorithm for online bandit linear optimization whose regret after T rounds is of order sqrt{Td ln N} on any finite class X of N actions in d dimensions, and of order d*sqrt{T} (up to log factors) when X is infinite.…
In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms--which have been shown to…
Combinatorial multi-armed bandits provide a fundamental online decision-making environment where a decision-maker interacts with an environment across $T$ time steps, each time selecting an action and learning the cost of that action. The…
Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can…
We introduce an online learning algorithm for computing adaptive resource allocation policies against strategic ecological adversaries with unknown behavioral models and partial observability. Our setting addresses a fundamental limitation…
Online learning algorithms have been successfully used to design caching policies with sublinear regret in the total number of requests, with no statistical assumption about the request sequence. Most existing algorithms involve…
We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition…