Related papers: Improved Regret for Efficient Online Reinforcement…

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret…

Machine Learning · Computer Science 2023-10-19 Haolin Liu , Chen-Yu Wei , Julian Zimmert

Refined Regret for Adversarial MDPs with Linear Function Approximation

We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in…

Machine Learning · Computer Science 2023-06-05 Yan Dai , Haipeng Luo , Chen-Yu Wei , Julian Zimmert

Improved learning rates in multi-unit uniform price auctions

Motivated by the strategic participation of electricity producers in electricity day-ahead market, we study the problem of online learning in repeated multi-unit uniform price auctions focusing on the adversarial opposing bid setting. The…

Computer Science and Game Theory · Computer Science 2025-01-20 Marius Potfer , Dorian Baudry , Hugo Richard , Vianney Perchet , Cheng Wan

Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By…

Machine Learning · Computer Science 2020-01-01 Zihan Zhang , Xiangyang Ji

Minimax Regret Bounds for Reinforcement Learning

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde{O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$…

Machine Learning · Statistics 2017-07-04 Mohammad Gheshlaghi Azar , Ian Osband , Rémi Munos

Optimistic Policy Optimization with Bandit Feedback

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of…

Machine Learning · Computer Science 2020-06-19 Yonathan Efroni , Lior Shani , Aviv Rosenberg , Shie Mannor

Impact of Representation Learning in Linear Bandits

We study how representation learning can improve the efficiency of bandit problems. We study the setting where we play $T$ linear bandits with dimension $d$ concurrently, and these $T$ bandit tasks share a common $k (\ll d)$ dimensional…

Machine Learning · Computer Science 2021-05-06 Jiaqi Yang , Wei Hu , Jason D. Lee , Simon S. Du

A Model Selection Approach for Corruption Robust Reinforcement Learning

We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. For finite-horizon tabular MDPs, without prior knowledge on the total amount of corruption, our algorithm…

Machine Learning · Computer Science 2024-12-31 Chen-Yu Wei , Christoph Dann , Julian Zimmert

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our…

Machine Learning · Computer Science 2026-03-16 Antoine Moulin , Gergely Neu , Luca Viano

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass…

Machine Learning · Computer Science 2021-07-20 Haipeng Luo , Chen-Yu Wei , Chung-Wei Lee

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition…

Machine Learning · Computer Science 2023-02-15 Fang Kong , Xiangcheng Zhang , Baoxiang Wang , Shuai Li

Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits

We consider an adversarial variant of the classic $K$-armed linear contextual bandit problem where the sequence of loss functions associated with each arm are allowed to change without restriction over time. Under the assumption that the…

Machine Learning · Computer Science 2022-05-25 Gergely Neu , Julia Olkhovskaya

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study…

Machine Learning · Computer Science 2021-02-24 Dongruo Zhou , Jiafan He , Quanquan Gu

Fast Rates for the Regret of Offline Reinforcement Learning

We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration…

Machine Learning · Computer Science 2023-07-13 Yichun Hu , Nathan Kallus , Masatoshi Uehara

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds…

Machine Learning · Computer Science 2022-10-24 Andrew Wagenmaker , Yifang Chen , Max Simchowitz , Simon S. Du , Kevin Jamieson

Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound

Exploration in reinforcement learning (RL) suffers from the curse of dimensionality when the state-action space is large. A common practice is to parameterize the high-dimensional value and policy functions using given features. However…

Machine Learning · Computer Science 2019-06-14 Lin F. Yang , Mengdi Wang

Online learning in MDPs with linear function approximation and bandit feedback

We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes, where the reward function is allowed to change between episodes in an adversarial manner and the learner only gets…

Machine Learning · Computer Science 2021-06-15 Gergely Neu , Julia Olkhovskaya

Adversarial Contextual Bandits Go Kernelized

We study a generalization of the problem of online learning in adversarial linear contextual bandits by incorporating loss functions that belong to a reproducing kernel Hilbert space, which allows for a more flexible modeling of complex…

Machine Learning · Statistics 2023-10-04 Gergely Neu , Julia Olkhovskaya , Sattar Vakili

Efficient Reinforcement Learning in Probabilistic Reward Machines

In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret…

Machine Learning · Statistics 2024-08-21 Xiaofeng Lin , Xuezhou Zhang

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and stochastic rewards drawn from some fixed but unknown distribution. The learner aims to learn the optimal policy and…

Machine Learning · Computer Science 2024-03-12 Vincent Leon , S. Rasoul Etesami