Related papers: Learning Adversarial Markov Decision Processes wit…

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision…

Machine Learning · Computer Science 2023-01-24 Tiancheng Jin , Tal Lancewicki , Haipeng Luo , Yishay Mansour , Aviv Rosenberg

Learning Markov Decision Processes under Fully Bandit Feedback

A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this…

Machine Learning · Computer Science 2026-02-03 Zhengjia Zhuo , Anupam Gupta , Viswanath Nagarajan

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret…

Machine Learning · Computer Science 2023-10-19 Haolin Liu , Chen-Yu Wei , Julian Zimmert

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and stochastic rewards drawn from some fixed but unknown distribution. The learner aims to learn the optimal policy and…

Machine Learning · Computer Science 2024-03-12 Vincent Leon , S. Rasoul Etesami

Online Markov Decision Processes with Aggregate Bandit Feedback

We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized…

Machine Learning · Computer Science 2021-02-02 Alon Cohen , Haim Kaplan , Tomer Koren , Yishay Mansour

Online Learning under Delayed Feedback

Online learning with delayed feedback has received increasing attention recently due to its several applications in distributed, web-based learning problems. In this paper we provide a systematic study of the topic, and analyze the effect…

Machine Learning · Computer Science 2015-07-02 Pooria Joulani , András György , Csaba Szepesvári

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound…

Machine Learning · Computer Science 2026-03-05 Harin Lee , Kevin Jamieson

Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs

Learning Markov decision processes (MDPs) in the presence of the adversary is a challenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the…

Machine Learning · Computer Science 2022-04-21 Jiafan He , Dongruo Zhou , Quanquan Gu

Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an…

Machine Learning · Computer Science 2023-08-29 Washim Uddin Mondal , Vaneet Aggarwal

Optimism and Delays in Episodic Reinforcement Learning

There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode…

Machine Learning · Computer Science 2023-04-07 Benjamin Howson , Ciara Pike-Burke , Sarah Filippi

Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes

The Adversarial Markov Decision Process (AMDP) is a learning framework that deals with unknown and varying tasks in decision-making applications like robotics and recommendation systems. A major limitation of the AMDP formalism, however, is…

Machine Learning · Statistics 2024-05-06 Sang Bin Moon , Abolfazl Hashemi

Online learning in MDPs with side information

We study online learning of finite Markov decision process (MDP) problems when a side information vector is available. The problem is motivated by applications such as clinical trials, recommendation systems, etc. Such applications have an…

Machine Learning · Computer Science 2014-06-27 Yasin Abbasi-Yadkori , Gergely Neu

Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss

We consider online learning for episodic stochastically constrained Markov decision processes (CMDPs), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the…

Machine Learning · Computer Science 2021-10-19 Shuang Qiu , Xiaohan Wei , Zhuoran Yang , Jieping Ye , Zhaoran Wang

Square-root regret bounds for continuous-time episodic Markov decision processes

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially…

Machine Learning · Computer Science 2023-10-04 Xuefeng Gao , Xun Yu Zhou

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves…

Machine Learning · Computer Science 2020-11-03 Chi Jin , Tiancheng Jin , Haipeng Luo , Suvrit Sra , Tiancheng Yu

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$…

Machine Learning · Computer Science 2025-03-06 Daniil Tiapkin , Evgenii Chzhen , Gilles Stoltz

Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

We study the problem of learning Markov decision processes with finite state and action spaces when the transition probability distributions and loss functions are chosen adversarially and are allowed to change with time. We introduce an…

Machine Learning · Computer Science 2013-03-14 Yasin Abbasi-Yadkori , Peter L. Bartlett , Csaba Szepesvari

Online learning in MDPs with linear function approximation and bandit feedback

We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes, where the reward function is allowed to change between episodes in an adversarial manner and the learner only gets…

Machine Learning · Computer Science 2021-06-15 Gergely Neu , Julia Olkhovskaya

Learning Adversarial MDPs with Stochastic Hard Constraints

We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we…

Machine Learning · Computer Science 2025-02-10 Francesco Emanuele Stradi , Matteo Castiglioni , Alberto Marchesi , Nicola Gatti

Online Learning in MDPs with Partially Adversarial Transitions and Losses

We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of $\Lambda$ steps per episode. This model captures environments that are stable except at a few…

Machine Learning · Computer Science 2026-02-11 Ofir Schlisselberg , Tal Lancewicki , Yishay Mansour