Related papers: Dynamic Memory for Interpretable Sequential Optimi…

A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits

We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their…

Machine Learning · Computer Science 2022-03-09 Yasin Abbasi-Yadkori , Andras Gyorgy , Nevena Lazic

Adapting Behaviour for Learning Progress

Determining what experience to generate to best facilitate learning (i.e. exploration) is one of the distinguishing features and open challenges in reinforcement learning. The advent of distributed agents that interact with parallel…

Machine Learning · Computer Science 2019-12-17 Tom Schaul , Diana Borsa , David Ding , David Szepesvari , Georg Ostrovski , Will Dabney , Simon Osindero

The multi-armed bandit problem with covariates

We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows…

Statistics Theory · Mathematics 2013-05-27 Vianney Perchet , Philippe Rigollet

Constrained Feedback Learning for Non-Stationary Multi-Armed Bandits

Non-stationary multi-armed bandits enable agents to adapt to changing environments by incorporating mechanisms to detect and respond to shifts in reward distributions, making them well-suited for dynamic settings. However, existing…

Machine Learning · Computer Science 2025-09-19 Shaoang Li , Jian Li

Learning Contextual Bandits in a Non-stationary Environment

Multi-armed bandit algorithms have become a reference solution for handling the explore/exploit dilemma in recommender systems, and many other important real-world problems, such as display advertisement. However, such algorithms usually…

Machine Learning · Computer Science 2018-05-25 Qingyun Wu , Naveen Iyer , Hongning Wang

Online Learning with Costly Features in Non-stationary Environments

Maximizing long-term rewards is the primary goal in sequential decision-making problems. The majority of existing methods assume that side information is freely available, enabling the learning agent to observe all features' states before…

Machine Learning · Computer Science 2023-07-19 Saeed Ghoorchian , Evgenii Kortukov , Setareh Maghsudi

Learning from an Exploring Demonstrator: Optimal Reward Estimation for Bandits

We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement…

Machine Learning · Statistics 2022-02-23 Wenshuo Guo , Kumar Krishna Agrawal , Aditya Grover , Vidya Muthukumar , Ashwin Pananjady

Hedging the Drift: Learning to Optimize under Non-Stationarity

We introduce data-driven decision-making algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and…

Machine Learning · Computer Science 2021-03-19 Wang Chi Cheung , David Simchi-Levi , Ruihao Zhu

Finite-Time Guarantees for Multi-Agent Combinatorial Bandits with Nonstationary Rewards

We study a sequential resource allocation problem where a decision maker selects subsets of agents at each period to maximize overall outcomes without prior knowledge of individual-level effects. Our framework applies to settings such as…

Machine Learning · Computer Science 2025-08-29 Katherine B. Adams , Justin J. Boutilier , Qinyang He , Yonatan Mintz

An Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior Knowledge

We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance…

Machine Learning · Statistics 2023-02-21 Kihyuk Hong , Yuhang Li , Ambuj Tewari

Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials

The stochastic multi-armed bandit problem is a well-known model for studying the exploration-exploitation trade-off. It has significant possible applications in adaptive clinical trials, which allow for dynamic changes in the treatment…

Machine Learning · Computer Science 2019-06-11 Hossein Aboutalebi , Doina Precup , Tibor Schuster

Contextual Bandits and Imitation Learning via Preference-Based Active Queries

We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively query an expert at each round to compare two actions and…

Machine Learning · Computer Science 2023-07-25 Ayush Sekhari , Karthik Sridharan , Wen Sun , Runzhe Wu

Efficient Contextual Bandits in Non-stationary Worlds

Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit…

Machine Learning · Computer Science 2019-04-05 Haipeng Luo , Chen-Yu Wei , Alekh Agarwal , John Langford

An Adaptive Method for Contextual Stochastic Multi-armed Bandits with Rewards Generated by a Linear Dynamical System

Online decision-making can be formulated as the popular stochastic multi-armed bandit problem where a learner makes decisions (or takes actions) to maximize cumulative rewards collected from an unknown environment. This paper proposes to…

Systems and Control · Electrical Eng. & Systems 2025-11-26 Jonathan Gornet , Mehdi Hosseinzadeh , Bruno Sinopoli

Adaptive Exploration for Latent-State Bandits

The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action…

Machine Learning · Computer Science 2026-02-19 Jikai Jin , Kenneth Hung , Sanath Kumar Krishnamurthy , Baoyi Shi , Congshan Zhang

Contextual Bandit Learning with Predictable Rewards

Contextual bandit learning is a reinforcement learning problem where the learner repeatedly receives a set of features (context), takes an action and receives a reward based on the action and context. We consider this problem under a…

Machine Learning · Computer Science 2012-03-05 Alekh Agarwal , Miroslav Dudík , Satyen Kale , John Langford , Robert E. Schapire

A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed Bandits

In a typical stochastic multi-armed bandit problem, the objective is often to maximize the expected sum of rewards over some time horizon $T$. While the choice of a strategy that accomplishes that is optimal with no additional information,…

Machine Learning · Computer Science 2023-11-01 Reda Alami , Mohammed Mahfoud , Mastane Achab

Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits

We study the problem of \emph{dynamic regret minimization} in $K$-armed Dueling Bandits under non-stationary or time varying preferences. This is an online learning setup where the agent chooses a pair of items at each round and observes…

Machine Learning · Computer Science 2022-06-14 Aadirupa Saha , Shubham Gupta

A Bandit Framework for Optimal Selection of Reinforcement Learning Agents

Deep Reinforcement Learning has been shown to be very successful in complex games, e.g. Atari or Go. These games have clearly defined rules, and hence allow simulation. In many practical applications, however, interactions with the…

Machine Learning · Computer Science 2019-02-12 Andreas Merentitis , Kashif Rasul , Roland Vollgraf , Abdul-Saboor Sheikh , Urs Bergmann

Dynamic Regret of Policy Optimization in Non-stationary Environments

We consider reinforcement learning (RL) in episodic MDPs with adversarial full-information reward feedback and unknown fixed transition kernels. We propose two model-free policy optimization algorithms, POWER and POWER++, and establish…

Machine Learning · Computer Science 2020-07-02 Yingjie Fei , Zhuoran Yang , Zhaoran Wang , Qiaomin Xie