Related papers: A Reduction-based Framework for Sequential Decisio…

Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback

Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and…

Artificial Intelligence · Computer Science 2025-11-17 Mohammadsina Almasi , Hadis Anahideh

Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow

For marketing, we sometimes need to recommend content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seeing recommended content on each page…

Machine Learning · Computer Science 2022-03-18 Wenjun Zeng , Yi Liu

Learning Adversarial Markov Decision Processes with Delayed Feedback

Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in…

Machine Learning · Computer Science 2021-12-16 Tal Lancewicki , Aviv Rosenberg , Yishay Mansour

Learning Markov Decision Processes under Fully Bandit Feedback

A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this…

Machine Learning · Computer Science 2026-02-03 Zhengjia Zhuo , Anupam Gupta , Viswanath Nagarajan

A Bandit Learning Method for Continuous Games under Feedback Delays with Residual Pseudo-Gradient Estimate

Learning in multi-player games can model a large variety of practical scenarios, where each player seeks to optimize its own local objective function, which at the same time relies on the actions taken by others. Motivated by the frequent…

Optimization and Control · Mathematics 2023-09-08 Yuanhanqing Huang , Jianghai Hu

Delay-Aware Model-Based Reinforcement Learning for Continuous Control

Action delays degrade the performance of reinforcement learning in many real-world systems. This paper proposes a formal definition of delay-aware Markov Decision Process and proves it can be transformed into standard MDP with augmented…

Machine Learning · Computer Science 2021-05-10 Baiming Chen , Mengdi Xu , Liang Li , Ding Zhao

Multi-Armed Bandit Strategies for Non-Stationary Reward Distributions and Delayed Feedback Processes

A survey is performed of various Multi-Armed Bandit (MAB) strategies in order to examine their performance in circumstances exhibiting non-stationary stochastic reward functions in conjunction with delayed feedback. We run several MAB…

Machine Learning · Computer Science 2019-07-31 Larkin Liu , Richard Downe , Joshua Reid

Biased Dueling Bandits with Stochastic Delayed Feedback

The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information…

Machine Learning · Computer Science 2025-04-08 Bongsoo Yi , Yue Kang , Yao Li

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision…

Machine Learning · Computer Science 2023-01-24 Tiancheng Jin , Tal Lancewicki , Haipeng Luo , Yishay Mansour , Aviv Rosenberg

Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions

We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the reward-dependent delay setting, where realized delays may depend on the stochastic rewards,…

Machine Learning · Computer Science 2021-06-07 Tal Lancewicki , Shahar Segal , Tomer Koren , Yishay Mansour

Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards

Decision-making problems of sequential nature, where decisions made in the past may have an impact on the future, are used to model many practically important applications. In some real-world applications, feedback about a decision is…

Machine Learning · Computer Science 2023-03-02 Ronald C. van den Broek , Rik Litjens , Tobias Sagis , Luc Siecker , Nina Verbeeke , Pratik Gajane

Best arm identification in multi-armed bandits with delayed feedback

We propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedback. The delay in feedback increases the effective sample…

Machine Learning · Computer Science 2018-03-30 Aditya Grover , Todor Markov , Peter Attia , Norman Jin , Nicholas Perkins , Bryan Cheong , Michael Chen , Zi Yang , Stephen Harris , William Chueh , Stefano Ermon

Revisiting State Augmentation methods for Reinforcement Learning with Stochastic Delays

Several real-world scenarios, such as remote control and sensing, are comprised of action and observation delays. The presence of delays degrades the performance of reinforcement learning (RL) algorithms, often to such an extent that…

Machine Learning · Computer Science 2021-08-18 Somjit Nath , Mayank Baranwal , Harshad Khadilkar

Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs

In order to make good decision under uncertainty an agent must learn from observations. To do so, two of the most common frameworks are Contextual Bandits and Markov Decision Processes (MDPs). In this paper, we study whether there exist…

Machine Learning · Computer Science 2019-11-05 Andrea Zanette , Emma Brunskill

Continuous-Time Distributed Dynamic Programming for Networked Multi-Agent Markov Decision Processes

The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where…

Systems and Control · Electrical Eng. & Systems 2024-06-14 Donghwan Lee , Han-Dong Lim , Do Wan Kim

Linear Bandits with Stochastic Delayed Feedback

Stochastic linear bandits are a natural and well-studied model for structured exploration/exploitation problems and are widely used in applications such as online marketing and recommendation. One of the main challenges faced by…

Machine Learning · Statistics 2020-03-03 Claire Vernade , Alexandra Carpentier , Tor Lattimore , Giovanni Zappella , Beyza Ermis , Michael Brueckner

Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual…

Machine Learning · Computer Science 2025-10-28 Shinji Ito , Kevin Jamieson , Haipeng Luo , Arnab Maiti , Taira Tsuchiya

Unified Models of Human Behavioral Agents in Bandits, Contextual Bandits and RL

Artificial behavioral agents are often evaluated based on their consistent behaviors and performance to take sequential actions in an environment to maximize some notion of cumulative reward. However, human decision making in real life…

Artificial Intelligence · Computer Science 2021-12-28 Baihan Lin , Guillermo Cecchi , Djallel Bouneffouf , Jenna Reinen , Irina Rish

Multi-Action Restless Bandits with Weakly Coupled Constraints: Simultaneous Learning and Control

We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions.…

Optimization and Control · Mathematics 2024-12-05 Jing Fu , Bill Moran , José Niño-Mora

Finite-Horizon Markov Decision Processes with Sequentially-Observed Transitions

Markov Decision Processes (MDPs) have been used to formulate many decision-making problems in science and engineering. The objective is to synthesize the best decision (action selection) policies to maximize expected rewards (or minimize…

Optimization and Control · Mathematics 2015-07-07 Mahmoud El Chamie , Behcet Acikmese