English
Related papers

Related papers: Thompson Sampling for Learning Parameterized Marko…

200 papers

We study parameterized MDPs (PMDPs) in which the key parameters of interest are unknown and must be learned using Bayesian inference. One key defining feature of such models is the presence of "uninformative" actions that provide no…

Systems and Control · Electrical Eng. & Systems 2023-05-16 Michael Gimelfarb , Michael Jong Kim

We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a…

Machine Learning · Computer Science 2024-07-03 Xuefeng Gao , Xun Yu Zhou

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space $\mathcal{S}$ and the action space $\mathcal{A}$ are both finite, to obtain a nearly optimal policy with…

Machine Learning · Computer Science 2022-10-28 Bingyan Wang , Yuling Yan , Jianqing Fan

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study…

Machine Learning · Computer Science 2021-02-24 Dongruo Zhou , Jiafan He , Quanquan Gu

Thompson Sampling is one of the most effective methods for contextual bandits and has been generalized to posterior sampling for certain MDP settings. However, existing posterior sampling methods for reinforcement learning are limited by…

Machine Learning · Computer Science 2022-08-24 Christoph Dann , Mehryar Mohri , Tong Zhang , Julian Zimmert

We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based…

Machine Learning · Computer Science 2020-06-25 Yi Tian , Jian Qian , Suvrit Sra

Thompson Sampling has been widely used for contextual bandit problems due to the flexibility of its modeling power. However, a general theory for this class of methods in the frequentist setting is still lacking. In this paper, we present a…

Machine Learning · Computer Science 2021-10-05 Tong Zhang

We consider the problem of learning to optimize an unknown Markov decision process (MDP). We show that, if the MDP can be parameterized within some known function class, we can obtain regret bounds that scale with the dimensionality, rather…

Machine Learning · Statistics 2014-11-04 Ian Osband , Benjamin Van Roy

We consider online reinforcement learning in episodic Markov decision process (MDP) with unknown transition function and stochastic rewards drawn from some fixed but unknown distribution. The learner aims to learn the optimal policy and…

Machine Learning · Computer Science 2024-03-12 Vincent Leon , S. Rasoul Etesami

A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various…

Machine Learning · Computer Science 2023-09-04 Falcon Z. Dai

We consider undiscounted reinforcement learning in Markov decision processes (MDPs) where both the reward functions and the state-transition probabilities may vary (gradually or abruptly) over time. For this problem setting, we propose an…

Machine Learning · Computer Science 2019-09-11 Pratik Gajane , Ronald Ortner , Peter Auer

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our…

Machine Learning · Computer Science 2020-04-01 Shipra Agrawal , Randy Jia

Contextual multi-armed bandits are classical models in reinforcement learning for sequential decision-making associated with individual information. A widely-used policy for bandits is Thompson Sampling, where samples from a data-driven…

Machine Learning · Statistics 2021-11-30 Hongju Park , Mohamad Kazem Shirani Faradonbeh

We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE).…

Machine Learning · Computer Science 2017-09-15 Yi Ouyang , Mukul Gagrani , Ashutosh Nayyar , Rahul Jain

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the inter-transition times of a continuous-time MDP are exponentially…

Machine Learning · Computer Science 2023-10-04 Xuefeng Gao , Xun Yu Zhou

We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. We analyze the regret of Thompson sampling (TS)…

Machine Learning · Statistics 2017-03-28 Marc Abeille , Alessandro Lazaric

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $\Omega(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action…

Machine Learning · Statistics 2014-11-04 Ian Osband , Benjamin Van Roy

We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific…

Machine Learning · Computer Science 2021-06-29 Damianos Tranos , Alexandre Proutiere

This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces,…

Machine Learning · Statistics 2026-04-09 Daniel Adelman , Cagla Keceli , Alba V. Olivares-Nadal

Reinforcement Learning algorithms that learn from human feedback (RLHF) need to be efficient in terms of statistical complexity, computational complexity, and query complexity. In this work, we consider the RLHF setting where the feedback…

Machine Learning · Computer Science 2024-03-14 Runzhe Wu , Wen Sun
‹ Prev 1 2 3 10 Next ›