Related papers: Adaptive Approximate Policy Iteration
We propose an epoch-based reinforcement learning algorithm for infinite-horizon average-cost Markov decision processes (MDPs) that leverages a partial order over a policy class. In this structure, $\pi' \leq \pi$ if data collected under…
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form…
Model-free reinforcement learning is known to be memory and computation efficient and more amendable to large scale problems. In this paper, two model-free algorithms are introduced for learning infinite-horizon average-reward Markov…
Learning Markov decision processes (MDPs) in the presence of the adversary is a challenging problem in reinforcement learning (RL). In this paper, we study RL in episodic MDPs with adversarial reward and full information feedback, where the…
In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy…
We study the problem of infinite-horizon average-reward reinforcement learning with linear Markov decision processes (MDPs). The associated Bellman operator of the problem not being a contraction makes the algorithm design challenging.…
We consider approximate dynamic programming in $\gamma$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy…
The Adversarial Markov Decision Process (AMDP) is a learning framework that deals with unknown and varying tasks in decision-making applications like robotics and recommendation systems. A major limitation of the AMDP formalism, however, is…
We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory…
We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our…
Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities, or intrinsic structure, of the problem in hand. Most current methods are geared towards exploiting the…
Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the…
Learning optimal policies from historical data enables personalization in a wide variety of applications including healthcare, digital recommendations, and online education. The growing policy learning literature focuses on settings where…
A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret…
We study algorithms for average-cost reinforcement learning problems with value function approximation. Our starting point is the recently proposed POLITEX algorithm, a version of policy iteration where the policy produced in each iteration…
We consider the problem of learning in adversarial Markov decision processes [MDPs] with an oblivious adversary in a full-information setting. The agent interacts with an environment during $T$ episodes, each of which consists of $H$…
We study an approach to policy selection for large relational Markov Decision Processes (MDPs). We consider a variant of approximate policy iteration (API) that replaces the usual value-function learning step with a learning step in policy…
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the…
We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous…
We describe an approximate dynamic programming (ADP) approach to compute approximations of the optimal strategies and of the minimal losses that can be guaranteed in discounted repeated games with vector-valued losses. Such games…