Related papers: Queueing Network Controls via Deep Reinforcement L…
Novel advanced policy gradient (APG) algorithms, such as proximal policy optimization (PPO), trust region policy optimization, and their variations, have become the dominant reinforcement learning (RL) algorithms because of their ease of…
In this paper we design hybrid control policies for hybrid systems whose mathematical models are unknown. Our contributions are threefold. First, we propose a framework for modelling the hybrid control design problem as a single Markov…
This paper tackles the growing issue of excessive data transmission in networks. With increasing traffic, backhaul links and core networks are under significant traffic, leading to the investigation of caching solutions at edge routers.…
We present an efficient reinforcement learning algorithm that learns the optimal admission control policy in a partially observable queueing network. Specifically, only the arrival and departure times from the network are observable, and…
The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is…
Instability and slowness are two main problems in deep reinforcement learning. Even if proximal policy optimization (PPO) is the state of the art, it still suffers from these two problems. We introduce an improved algorithm based on…
Proximal policy optimization (PPO) is one of the most successful deep reinforcement-learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from…
Efficient mobility management and load balancing are critical to sustaining Quality of Service (QoS) in dense, highly dynamic 5G radio access networks. We present a deep reinforcement learning framework based on Proximal Policy Optimization…
While reinforcement learning has been increasingly applied to stochastic control, few studies have systematically examined policy-based methods in queuing environments modeled as a semi-Markov decision process (SMDP). To address this gap,…
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.…
On-policy deep reinforcement learning algorithms have low data utilization and require significant experience for policy improvement. This paper proposes a proximal policy optimization algorithm with prioritized trajectory replay (PTR-PPO)…
We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate…
Proximal Policy Optimization (PPO) is among the most widely used deep reinforcement learning algorithms, yet its theoretical foundations remain incomplete. Most importantly, convergence and understanding of fundamental PPO advantages remain…
A wide variety of queueing systems can be naturally modeled as infinite-state Markov Decision Processes (MDPs). In the reinforcement learning (RL) context, a variety of algorithms have been developed to learn and optimize these MDPs. At the…
Multi-objective optimization models that encode ordered sequential constraints provide a solution to model various challenging problems including encoding preferences, modeling a curriculum, and enforcing measures of safety. A recently…
Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn…
To overcome the curses of dimensionality and modeling of Dynamic Programming (DP) methods to solve Markov Decision Process (MDP) problems, Reinforcement Learning (RL) methods are adopted in practice. Contrary to traditional RL algorithms…
Standard Markov decision process (MDP) and reinforcement learning algorithms optimize the policy with respect to the expected gain. We propose an algorithm which enables to optimize an alternative objective: the probability that the gain is…
Most methods in reinforcement learning use a Policy Gradient (PG) approach to learn a parametric stochastic policy that maps states to actions. The standard approach is to implement such a mapping via a neural network (NN) whose parameters…
This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov…