Related papers: Efficient iterative policy optimization
Solving tasks in Reinforcement Learning is no easy feat. As the goal of the agent is to maximize the accumulated reward, it often learns to exploit loopholes and misspecifications in the reward signal resulting in unwanted behavior. While…
We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy…
A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or $Q$-function…
We address a practical problem ubiquitous in modern marketing campaigns, in which a central agent tries to learn a policy for allocating strategic financial incentives to customers and observes only bandit feedback. In contrast to…
The dynamic portfolio optimization problem in finance frequently requires learning policies that adhere to various constraints, driven by investor preferences and risk. We motivate this problem of finding an allocation policy within a…
Constrained decision-making is essential for designing safe policies in real-world control systems, yet simulated environments often fail to capture real-world adversities. We consider the problem of learning a policy that will maximize the…
The note studies the problem of selecting a good enough subset out of a finite number of alternatives under a fixed simulation budget. Our work aims to maximize the posterior probability of correctly selecting a good subset. We formulate…
Affine policies (or control) are widely used as a solution approach in dynamic optimization where computing an optimal adjustable solution is usually intractable. While the worst case performance of affine policies can be significantly bad,…
We consider the problem of reinforcement learning when provided with (1) a baseline control policy and (2) a set of constraints that the learner must satisfy. The baseline policy can arise from demonstration data or a teacher agent and may…
Searching the space of policies directly for the optimal policy has been one popular method for solving partially observable reinforcement learning problems. Typically, with each change of the target policy, its value is estimated from the…
Empirical researchers and decision-makers spanning various domains frequently seek profound insights into the long-term impacts of interventions. While the significance of long-term outcomes is undeniable, an overemphasis on them may…
An iterative optimization approach that simultaneously minimizes the energy and optimizes the Lagrange multipliers enforcing desired constraints is presented. The method is tested on previously established benchmark systems and it is proved…
Recent policy optimization approaches (Schulman et al., 2015a; 2017) have achieved substantial empirical successes by constructing new proxy optimization objectives. These proxy objectives allow stable and low variance policy learning, but…
Recommender systems predict what items a user will interact with next, based on their past interactions. The problem is often approached through supervised learning, but recent advancements have shifted towards policy optimization of…
Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate. Due to the sequential nature of reinforcement learning, any improper data-collecting…
Markov decision processes are widely used for planning and verification in settings that combine controllable or adversarial choices with probabilistic behaviour. The standard analysis algorithm, value iteration, only provides a lower bound…
Direct policy optimization in reinforcement learning is usually solved with policy-gradient algorithms, which optimize policy parameters via stochastic gradient ascent. This paper provides a new theoretical interpretation and justification…
We investigate constrained optimal control problems for linear stochastic dynamical systems evolving in discrete time. We consider minimization of an expected value cost over a finite horizon. Hard constraints are introduced first, and then…
Off-policy evaluation provides an essential tool for evaluating the effects of different policies or treatments using only observed data. When applied to high-stakes scenarios such as medical diagnosis or financial decision-making, it is…
A data-based policy for iterative control task is presented. The proposed strategy is model-free and can be applied whenever safe input and state trajectories of a system performing an iterative task are available. These trajectories,…