Related papers: A Reduction-based Framework for Sequential Decisio…
Equitably allocating limited resources in high-stakes domains-such as education, employment, and healthcare-requires balancing short-term utility with long-term impact, while accounting for delayed outcomes, hidden heterogeneity, and…
For marketing, we sometimes need to recommend content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seeing recommended content on each page…
Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in…
A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this…
Learning in multi-player games can model a large variety of practical scenarios, where each player seeks to optimize its own local objective function, which at the same time relies on the actions taken by others. Motivated by the frequent…
Action delays degrade the performance of reinforcement learning in many real-world systems. This paper proposes a formal definition of delay-aware Markov Decision Process and proves it can be transformed into standard MDP with augmented…
A survey is performed of various Multi-Armed Bandit (MAB) strategies in order to examine their performance in circumstances exhibiting non-stationary stochastic reward functions in conjunction with delayed feedback. We run several MAB…
The dueling bandit problem, an essential variation of the traditional multi-armed bandit problem, has become significantly prominent recently due to its broad applications in online advertising, recommendation systems, information…
The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision…
We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the reward-dependent delay setting, where realized delays may depend on the stochastic rewards,…
Decision-making problems of sequential nature, where decisions made in the past may have an impact on the future, are used to model many practically important applications. In some real-world applications, feedback about a decision is…
We propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedback. The delay in feedback increases the effective sample…
Several real-world scenarios, such as remote control and sensing, are comprised of action and observation delays. The presence of delays degrades the performance of reinforcement learning (RL) algorithms, often to such an extent that…
In order to make good decision under uncertainty an agent must learn from observations. To do so, two of the most common frameworks are Contextual Bandits and Markov Decision Processes (MDPs). In this paper, we study whether there exist…
The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where…
Stochastic linear bandits are a natural and well-studied model for structured exploration/exploitation problems and are widely used in applications such as online marketing and recommendation. One of the main challenges faced by…
We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual…
Artificial behavioral agents are often evaluated based on their consistent behaviors and performance to take sequential actions in an environment to maximize some notion of cumulative reward. However, human decision making in real life…
We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions.…
Markov Decision Processes (MDPs) have been used to formulate many decision-making problems in science and engineering. The objective is to synthesize the best decision (action selection) policies to maximize expected rewards (or minimize…