Related papers: Multiplier Bootstrap-based Exploration
The multi-armed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to…
In this paper, we propose a novel perturbation-based exploration method in bandit algorithms with bounded or unbounded rewards, called residual bootstrap exploration (\texttt{ReBoot}). The \texttt{ReBoot} enforces exploration by injecting…
In this paper, we consider stochastic multi-armed bandits (MABs) with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $\nu_{p}$ for $1<p\leq2$. First, we propose a novel robust estimator which does not require $\nu_{p}$…
We study incentivized exploration in multi-armed bandit (MAB) settings with infinitely many arms modeled as elements in continuous metric spaces. Unlike classical bandit models, we consider scenarios where the decision-maker (principal)…
Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the so-called optimistic model.…
The stochastic multi-armed bandit problem is a well-known model for studying the exploration-exploitation trade-off. It has significant possible applications in adaptive clinical trials, which allow for dynamic changes in the treatment…
Upper Confidence Bound (UCB) method is arguably the most celebrated one used in online decision making with partial information feedback. Existing techniques for constructing confidence bounds are typically built upon various concentration…
Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive control, we propose RBMLE -- a novel family of learning algorithms for stochastic multi-armed bandits (SMABs). For a broad range of SMABs including both the…
In many platforms, user arrivals exhibit a self-reinforcing behavior: future user arrivals are likely to have preferences similar to users who were satisfied in the past. In other words, arrivals exhibit positive externalities. We study…
We study the non-stationary stochastic multiarmed bandit (MAB) problem and propose two generic algorithms, namely, the limited memory deterministic sequencing of exploration and exploitation (LM-DSEE) and the Sliding-Window Upper Confidence…
Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward…
We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the…
The Multiarmed Bandits (MAB) problem has been extensively studied and has seen many practical applications in a variety of fields. The Survival Multiarmed Bandits (S-MAB) open problem is an extension which constrains an agent to a budget…
This paper introduces the framework of multi-armed sampling, which serves as the sampling counterpart to the optimization problem of multi-armed bandits. Our primary motivation is to rigorously examine the exploration-exploitation trade-off…
We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its…
We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward…
We study incentivized exploration for the multi-armed bandit (MAB) problem where the players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on reward. We seek to understand the impact of…
This work addresses the problem of regret minimization in non-stochastic multi-armed bandit problems, focusing on performance guarantees that hold with high probability. Such results are rather scarce in the literature since proving them…
In budget-limited multi-armed bandit (MAB) problems, the learner's actions are costly and constrained by a fixed budget. Consequently, an optimal exploitation policy may not be to pull the optimal arm repeatedly, as is the case in other…
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with unknown reward models. At each time, a player selects one arm to play, aiming to maximize the total expected reward over a horizon of length T. An approach based on…