Related papers: Exploration-Enhanced POLITEX
In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy…
We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory…
Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains. However, the theoretical understanding of such algorithms is limited,…
While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will…
We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our…
We study the problem of infinite-horizon average-reward reinforcement learning with linear Markov decision processes (MDPs). The associated Bellman operator of the problem not being a contraction makes the algorithm design challenging.…
The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an…
We study reinforcement learning (RL) for decision processes with non-Markovian reward, in which high-level knowledge of the task in the form of reward machines is available to the learner. We consider probabilistic reward machines with…
We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in…
A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world,…
Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new…
In model-based solution approaches to the problem of learning in an unknown environment, exploring to learn the model parameters takes a toll on the regret. The optimal performance with respect to regret or PAC bounds is achievable, if the…
Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies.…
How do you incentivize self-interested agents to $\textit{explore}$ when they prefer to $\textit{exploit}$? We consider complex exploration problems, where each agent faces the same (but unknown) MDP. In contrast with traditional…
This work theoretically studies a ubiquitous reinforcement learning policy for controlling the canonical model of continuous-time stochastic linear-quadratic systems. We show that randomized certainty equivalent policy addresses the…
We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over…
Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass…
We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration rates of suboptimal (state, action) pairs. For…
We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic…
The specification of aMarkov decision process (MDP) can be difficult. Reward function specification is especially problematic; in practice, it is often cognitively complex and time-consuming for users to precisely specify rewards. This work…