Related papers: Optimistic Policy Optimization with Bandit Feedbac…

Delay-Adapted Policy Optimization and Improved Regret for Adversarial MDP with Delayed Bandit Feedback

Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning (RL). Thus, theoretical guarantees for PO algorithms have become especially important to the RL community. In this paper, we study PO in adversarial MDPs…

Machine Learning · Computer Science 2023-05-16 Tal Lancewicki , Aviv Rosenberg , Dmitry Sotnikov

Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory…

Machine Learning · Computer Science 2023-01-31 Uri Sherman , Tomer Koren , Yishay Mansour

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret…

Machine Learning · Computer Science 2023-10-19 Haolin Liu , Chen-Yu Wei , Julian Zimmert

A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is…

Machine Learning · Computer Science 2023-06-09 Han Zhong , Tong Zhang

Optimistic Distributionally Robust Policy Optimization

Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), as the widely employed policy based reinforcement learning (RL) methods, are prone to converge to a sub-optimal solution as they limit the policy representation…

Machine Learning · Computer Science 2020-06-16 Jun Song , Chaoyue Zhao

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves…

Machine Learning · Computer Science 2020-11-03 Chi Jin , Tiancheng Jin , Haipeng Luo , Suvrit Sra , Tiancheng Yu

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is…

Machine Learning · Computer Science 2019-12-13 Lior Shani , Yonathan Efroni , Shie Mannor

Near-optimal Policy Optimization Algorithms for Learning Adversarial Linear Mixture MDPs

Learning Markov decision processes (MDPs) in the presence of the adversary is a challenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the…

Machine Learning · Computer Science 2022-04-21 Jiafan He , Dongruo Zhou , Quanquan Gu

Complete Policy Regret Bounds for Tallying Bandits

Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy…

Machine Learning · Statistics 2022-04-26 Dhruv Malik , Yuanzhi Li , Aarti Singh

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass…

Machine Learning · Computer Science 2021-07-20 Haipeng Luo , Chen-Yu Wei , Chung-Wei Lee

Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire…

Machine Learning · Computer Science 2025-02-07 Tal Lancewicki , Yishay Mansour

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our…

Machine Learning · Computer Science 2026-03-16 Antoine Moulin , Gergely Neu , Luca Viano

Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al.(2024) proposed the first best-of-both-worlds algorithm able…

Machine Learning · Computer Science 2025-02-10 Francesco Emanuele Stradi , Anna Lunghi , Matteo Castiglioni , Alberto Marchesi , Nicola Gatti

Minimax Regret Bounds for Reinforcement Learning

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde{O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$…

Machine Learning · Statistics 2017-07-04 Mohammad Gheshlaghi Azar , Ian Osband , Rémi Munos

On learning Whittle index policy for restless bandits with scalable regret

Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies based on data when the system model is unknown. However, the cumulative regret of most RL algorithms scales as $\tilde O(\mathsf{S}…

Machine Learning · Computer Science 2023-04-28 Nima Akbarzadeh , Aditya Mahajan

Best of Both Worlds Policy Optimization

Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight…

Machine Learning · Computer Science 2023-02-21 Christoph Dann , Chen-Yu Wei , Julian Zimmert

Policy Optimization as Online Learning with Mediator Feedback

Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available…

Machine Learning · Computer Science 2020-12-16 Alberto Maria Metelli , Matteo Papini , Pierluca D'Oro , Marcello Restelli

Towards Tractable Optimism in Model-Based Reinforcement Learning

The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate…

Machine Learning · Computer Science 2021-12-07 Aldo Pacchiano , Philip J. Ball , Jack Parker-Holder , Krzysztof Choromanski , Stephen Roberts

Near-Optimal Regret for Policy Optimization in Contextual MDPs with General Offline Function Approximation

We introduce \texttt{OPO-CMDP}, the first policy optimization algorithm for stochastic Contextual Markov Decision Process (CMDPs) under general offline function approximation. Our approach achieves a high probability regret bound of…

Machine Learning · Computer Science 2026-02-17 Orin Levy , Aviv Rosenberg , Alon Cohen , Yishay Mansour

Stochastic Bandits with Linear Constraints

We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of $T$ rounds is maximum, and each has an expected cost below a…

Machine Learning · Computer Science 2020-06-20 Aldo Pacchiano , Mohammad Ghavamzadeh , Peter Bartlett , Heinrich Jiang