Related papers: Intervention Efficient Algorithm for Two-Stage Cau…

Causal Markov Decision Processes: Learning Good Interventions Efficiently

We introduce causal Markov Decision Processes (C-MDPs), a new formalism for sequential decision making which combines the standard MDP formulation with causal structures over state transition and reward functions. Many contemporary and…

Machine Learning · Statistics 2021-02-16 Yangyi Lu , Amirhossein Meisami , Ambuj Tewari

A Causal Bandit Approach to Learning Good Atomic Interventions in Presence of Unobserved Confounders

We study the problem of determining the best intervention in a Causal Bayesian Network (CBN) specified only by its causal graph. We model this as a stochastic multi-armed bandit (MAB) problem with side-information, where the interventions…

Machine Learning · Computer Science 2022-05-20 Aurghya Maiti , Vineet Nair , Gaurav Sinha

Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent…

Machine Learning · Computer Science 2026-02-03 Mingyi Li , Taira Tsuchiya , Kenji Yamanishi

Regret Analysis in Deterministic Reinforcement Learning

We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific…

Machine Learning · Computer Science 2021-06-29 Damianos Tranos , Alexandre Proutiere

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and…

Machine Learning · Computer Science 2026-03-26 Guy Zamir , Matthew Zurek , Yudong Chen

Budgeted and Non-budgeted Causal Bandits

Learning good interventions in a causal graph can be modelled as a stochastic multi-armed bandit problem with side-information. First, we study this problem when interventions are more expensive than observations and a budget is specified.…

Machine Learning · Computer Science 2020-12-15 Vineet Nair , Vishakha Patil , Gaurav Sinha

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments

We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic…

Machine Learning · Computer Science 2023-05-23 Runlong Zhou , Zihan Zhang , Simon S. Du

Partial Structure Discovery is Sufficient for No-regret Learning in Causal Bandits

Causal knowledge about the relationships among decision variables and a reward variable in a bandit setting can accelerate the learning of an optimal decision. Current works often assume the causal graph is known, which may not always be…

Machine Learning · Statistics 2024-11-07 Muhammad Qasim Elahi , Mahsa Ghasemi , Murat Kocaoglu

Learning Markov Decision Processes under Fully Bandit Feedback

A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this…

Machine Learning · Computer Science 2026-02-03 Zhengjia Zhuo , Anupam Gupta , Viswanath Nagarajan

Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the…

Machine Learning · Computer Science 2021-02-02 Qinbo Bai , Vaneet Aggarwal , Ather Gattami

A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs). While, similar to prior work (e.g., for ergodic MDPs), the lower-bound is the solution to an…

Machine Learning · Computer Science 2021-06-25 Andrea Tirinzoni , Matteo Pirotta , Alessandro Lazaric

Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap

This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the…

Machine Learning · Computer Science 2021-07-05 Haike Xu , Tengyu Ma , Simon S. Du

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our…

Machine Learning · Computer Science 2026-03-16 Antoine Moulin , Gergely Neu , Luca Viano

Confounded Budgeted Causal Bandits

We study the problem of learning 'good' interventions in a stochastic environment modeled by its underlying causal graph. Good interventions refer to interventions that maximize rewards. Specifically, we consider the setting of a…

Machine Learning · Computer Science 2024-01-17 Fateme Jamshidi , Jalal Etesami , Negar Kiyavash

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining both sublinear regret and sublinear constraint violation, when competing…

Machine Learning · Computer Science 2024-09-27 Francesco Emanuele Stradi , Anna Lunghi , Matteo Castiglioni , Alberto Marchesi , Nicola Gatti

A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model

We propose novel classical and quantum online algorithms for learning finite-horizon and infinite-horizon average-reward Markov Decision Processes (MDPs). Our algorithms are based on a hybrid exploration-generative reinforcement learning…

Machine Learning · Computer Science 2025-08-12 Andris Ambainis , Joao F. Doriguello , Debbie Lim

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm,…

Machine Learning · Computer Science 2024-02-06 Qinbo Bai , Washim Uddin Mondal , Vaneet Aggarwal

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the…

Machine Learning · Computer Science 2021-10-27 Christoph Dann , Teodor V. Marinov , Mehryar Mohri , Julian Zimmert

Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based…

Machine Learning · Computer Science 2020-06-25 Yi Tian , Jian Qian , Suvrit Sra

Reinforcement Learning from Adversarial Preferences in Tabular MDPs

We introduce a new framework of episodic tabular Markov decision processes (MDPs) with adversarial preferences, which we refer to as preference-based MDPs (PbMDPs). Unlike standard episodic MDPs with adversarial losses, where the numerical…

Machine Learning · Computer Science 2025-07-17 Taira Tsuchiya , Shinji Ito , Haipeng Luo