Related papers: Thompson Sampling for Learning Parameterized Marko…

Thompson Sampling for Parameterized Markov Decision Processes with Uninformative Actions

We study parameterized MDPs (PMDPs) in which the key parameters of interest are unknown and must be learned using Bayesian inference. One key defining feature of such models is the presence of "uninformative" actions that provide no…

Systems and Control · Electrical Eng. & Systems 2023-05-16 Michael Gimelfarb , Michael Jong Kim

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a…

Machine Learning · Computer Science 2024-07-03 Xuefeng Gao , Xun Yu Zhou

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space $\mathcal{S}$ and the action space $\mathcal{A}$ are both finite, to obtain a nearly optimal policy with…

Machine Learning · Computer Science 2022-10-28 Bingyan Wang , Yuling Yan , Jianqing Fan

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study…

Machine Learning · Computer Science 2021-02-24 Dongruo Zhou , Jiafan He , Quanquan Gu

A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning

Thompson Sampling is one of the most effective methods for contextual bandits and has been generalized to posterior sampling for certain MDP settings. However, existing posterior sampling methods for reinforcement learning are limited by…

Machine Learning · Computer Science 2022-08-24 Christoph Dann , Mehryar Mohri , Tong Zhang , Julian Zimmert

Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based…

Machine Learning · Computer Science 2020-06-25 Yi Tian , Jian Qian , Suvrit Sra

Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning

Thompson Sampling has been widely used for contextual bandit problems due to the flexibility of its modeling power. However, a general theory for this class of methods in the frequentist setting is still lacking. In this paper, we present a…

Machine Learning · Computer Science 2021-10-05 Tong Zhang

Model-based Reinforcement Learning and the Eluder Dimension

We consider the problem of learning to optimize an unknown Markov decision process (MDP). We show that, if the MDP can be parameterized within some known function class, we can obtain regret bounds that scale with the dimensionality, rather…

Machine Learning · Statistics 2014-11-04 Ian Osband , Benjamin Van Roy

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and stochastic rewards drawn from some fixed but unknown distribution. The learner aims to learn the optimal policy and…

Machine Learning · Computer Science 2024-03-12 Vincent Leon , S. Rasoul Etesami

On Reward Structures of Markov Decision Processes

A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various…

Machine Learning · Computer Science 2023-09-04 Falcon Z. Dai

Variational Regret Bounds for Reinforcement Learning

We consider undiscounted reinforcement learning in Markov decision processes (MDPs) where both the reward functions and the state-transition probabilities may vary (gradually or abruptly) over time. For this problem setting, we propose an…

Machine Learning · Computer Science 2019-09-11 Pratik Gajane , Ronald Ortner , Peter Auer

Posterior sampling for reinforcement learning: worst-case regret bounds

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our…

Machine Learning · Computer Science 2020-04-01 Shipra Agrawal , Randy Jia

Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits

Contextual multi-armed bandits are classical models in reinforcement learning for sequential decision-making associated with individual information. A widely-used policy for bandits is Thompson Sampling, where samples from a data-driven…

Machine Learning · Statistics 2021-11-30 Hongju Park , Mohamad Kazem Shirani Faradonbeh

Learning Unknown Markov Decision Processes: A Thompson Sampling Approach

We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE).…

Machine Learning · Computer Science 2017-09-15 Yi Ouyang , Mukul Gagrani , Ashutosh Nayyar , Rahul Jain

Square-root regret bounds for continuous-time episodic Markov decision processes

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially…

Machine Learning · Computer Science 2023-10-04 Xuefeng Gao , Xun Yu Zhou

Thompson Sampling for Linear-Quadratic Control Problems

We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. We analyze the regret of Thompson sampling (TS)…

Machine Learning · Statistics 2017-03-28 Marc Abeille , Alessandro Lazaric

Near-optimal Reinforcement Learning in Factored MDPs

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action…

Machine Learning · Statistics 2014-11-04 Ian Osband , Benjamin Van Roy

Regret Analysis in Deterministic Reinforcement Learning

We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific…

Machine Learning · Computer Science 2021-06-29 Damianos Tranos , Alexandre Proutiere

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces,…

Machine Learning · Statistics 2026-04-09 Daniel Adelman , Cagla Keceli , Alba V. Olivares-Nadal

Making RL with Preference-based Feedback Efficient via Randomization

Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be efficient in terms of statistical complexity, computational complexity, and query complexity. In this work, we consider the RLHF setting where the feedback…

Machine Learning · Computer Science 2024-03-14 Runzhe Wu , Wen Sun