Related papers: Experiment Planning with Function Approximation

Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws

Specifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts…

Machine Learning · Computer Science 2023-02-27 Kush Bhatia , Wenshuo Guo , Jacob Steinhardt

Contextual Bandits in a Survey Experiment on Charitable Giving: Within-Experiment Outcomes versus Policy Learning

We design and implement an adaptive experiment (a ``contextual bandit'') to learn a targeted treatment assignment policy, where the goal is to use a participant's survey responses to determine which charity to expose them to in a donation…

Econometrics · Economics 2022-11-23 Susan Athey , Undral Byambadalai , Vitor Hadad , Sanath Kumar Krishnamurthy , Weiwen Leung , Joseph Jay Williams

Meta-Learning Bandit Policies by Gradient Ascent

Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters. The former…

Machine Learning · Computer Science 2021-01-07 Branislav Kveton , Martin Mladenov , Chih-Wei Hsu , Manzil Zaheer , Csaba Szepesvari , Craig Boutilier

Provable Benefits of Policy Learning from Human Preferences in Contextual Bandit Problems

For a real-world decision-making problem, the reward function often needs to be engineered or learned. A popular approach is to utilize human feedback to learn a reward function for training. The most straightforward way to do so is to ask…

Machine Learning · Computer Science 2023-10-31 Xiang Ji , Huazheng Wang , Minshuo Chen , Tuo Zhao , Mengdi Wang

Active Learning for Stochastic Contextual Linear Bandits

A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions but naively (passively) sampling contexts from the…

Machine Learning · Computer Science 2026-05-26 Emma Brunskill , Ishani Karmarkar , Zhaoqi Li

Fractional Moments on Bandit Problems

Reinforcement learning addresses the dilemma between exploration to find profitable actions and exploitation to act according to the best observations already made. Bandit problems are one such class of problems in stateless environments…

Machine Learning · Computer Science 2012-02-20 Ananda Narayanan B , Balaraman Ravindran

Adapting Behaviour via Intrinsic Reward: A Survey and Empirical Study

Learning about many things can provide numerous benefits to a reinforcement learning system. For example, learning many auxiliary value functions, in addition to optimizing the environmental reward, appears to improve both exploration and…

Machine Learning · Computer Science 2020-08-25 Cam Linke , Nadia M. Ady , Martha White , Thomas Degris , Adam White

Contextual Bandit Learning with Predictable Rewards

Contextual bandit learning is a reinforcement learning problem where the learner repeatedly receives a set of features (context), takes an action and receives a reward based on the action and context. We consider this problem under a…

Machine Learning · Computer Science 2012-03-05 Alekh Agarwal , Miroslav Dudík , Satyen Kale , John Langford , Robert E. Schapire

Neural Dueling Bandits: Preference-Based Optimization with Human Feedback

Contextual dueling bandit is used to model the bandit problems, where a learner's goal is to find the best arm for a given context using observed noisy human preference feedback over the selected arms for the past contexts. However,…

Machine Learning · Computer Science 2025-04-17 Arun Verma , Zhongxiang Dai , Xiaoqiang Lin , Patrick Jaillet , Bryan Kian Hsiang Low

Geometry Meets Incentives: Sample-Efficient Incentivized Exploration with Linear Contexts

In the incentivized exploration model, a principal aims to explore and learn over time by interacting with a sequence of self-interested agents. It has been recently understood that the main challenge in designing incentive-compatible…

Computer Science and Game Theory · Computer Science 2025-06-03 Benjamin Schiffer , Mark Sellke

Contextual Bandits and Imitation Learning via Preference-Based Active Queries

We consider the problem of contextual bandits and imitation learning, where the learner lacks direct knowledge of the executed action's reward. Instead, the learner can actively query an expert at each round to compare two actions and…

Machine Learning · Computer Science 2023-07-25 Ayush Sekhari , Karthik Sridharan , Wen Sun , Runzhe Wu

Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation

We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{reward-free} exploration setting. This is a well-motivated problem because deploying new policies is costly in…

Machine Learning · Computer Science 2023-02-23 Dan Qiao , Yu-Xiang Wang

Satisficing Exploration for Deep Reinforcement Learning

A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world,…

Machine Learning · Computer Science 2024-07-23 Dilip Arumugam , Saurabh Kumar , Ramki Gummadi , Benjamin Van Roy

Adaptive Information Belief Space Planning

Reasoning about uncertainty is vital in many real-life autonomous systems. However, current state-of-the-art planning algorithms cannot either reason about uncertainty explicitly, or do so with a high computational burden. Here, we focus on…

Artificial Intelligence · Computer Science 2022-01-31 Moran Barenboim , Vadim Indelman

Efficient Algorithms for Learning to Control Bandits with Unobserved Contexts

Contextual bandits are widely-used in the study of learning-based control policies for finite action spaces. While the problem is well-studied for bandits with perfectly observed context vectors, little is known about the case of…

Machine Learning · Statistics 2022-02-03 Hongju Park , Mohamad Kazem Shirani Faradonbeh

Rollout Sampling Approximate Policy Iteration

Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions which focus on policy…

Machine Learning · Computer Science 2008-07-06 Christos Dimitrakakis , Michail G. Lagoudakis

Adapting Behaviour for Learning Progress

Determining what experience to generate to best facilitate learning (i.e. exploration) is one of the distinguishing features and open challenges in reinforcement learning. The advent of distributed agents that interact with parallel…

Machine Learning · Computer Science 2019-12-17 Tom Schaul , Diana Borsa , David Ding , David Szepesvari , Georg Ostrovski , Will Dabney , Simon Osindero

Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

We study learning to learn for the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and an algorithm should exploit the shared…

Machine Learning · Computer Science 2025-10-24 Subhojyoti Mukherjee , Josiah P. Hanna , Qiaomin Xie , Robert Nowak

Contextual Bandits with Stage-wise Constraints

We study contextual bandits in the presence of a stage-wise constraint when the constraint must be satisfied both with high probability and in expectation. We start with the linear case where both the reward function and the stage-wise…

Machine Learning · Computer Science 2025-08-22 Aldo Pacchiano , Mohammad Ghavamzadeh , Peter Bartlett

Satisficing in Time-Sensitive Bandit Learning

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. One shortcoming is that this orientation does not account for time sensitivity, which can play a crucial role when learning an…

Machine Learning · Computer Science 2020-01-09 Daniel Russo , Benjamin Van Roy