Related papers: Quantile-Based Policy Optimization for Reinforceme…

Quantile-Based Deep Reinforcement Learning using Two-Timescale Policy Gradient Algorithms

Classical reinforcement learning (RL) aims to optimize the expected cumulative reward. In this work, we consider the RL setting where the goal is to optimize the quantile of the cumulative reward. We parameterize the policy controlling…

Machine Learning · Computer Science 2023-05-15 Jinyang Jiang , Jiaqiao Hu , Yijie Peng

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions

Aligning large language models with pointwise absolute rewards has so far required online, on-policy algorithms such as PPO and GRPO. In contrast, simpler methods that can leverage offline or off-policy data, such as DPO and REBEL, are…

Machine Learning · Computer Science 2025-12-02 Simon Matrenok , Skander Moalla , Caglar Gulcehre

Quantile Constrained Reinforcement Learning: A Reinforcement Learning Framework Constraining Outage Probability

Constrained reinforcement learning (RL) is an area of RL whose objective is to find an optimal policy that maximizes expected cumulative return while satisfying a given constraint. Most of the previous constrained RL works consider expected…

Machine Learning · Computer Science 2022-11-29 Whiyoung Jung , Myungsik Cho , Jongeui Park , Youngchul Sung

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in…

Machine Learning · Computer Science 2026-05-14 Asim Osman , Sasha Abramowitz , Mark Bergh , Ulrich Armel Mbou Sob , Ruan John de Kock , Omayma Mahjoub , Oussama Hidaoui , Noah De Nicola , Arnol Manuel Fokam , Felix Chalumeau , Daniel Rajaonarivonivelomanantsoa , Siddarth Singh , Refiloe Shabe , Juan Claude Formanek , Simon Verster Du Toit , Arnu Pretorius

From Classical Data to Quantum Advantage -- Quantum Policy Evaluation on Quantum Hardware

Quantum policy evaluation (QPE) is a reinforcement learning (RL) algorithm which is quadratically more efficient than an analogous classical Monte Carlo estimation. It makes use of a direct quantum mechanical realization of a finite Markov…

Quantum Physics · Physics 2025-09-10 Daniel Hein , Simon Wiedemann , Markus Baumann , Patrik Felbinger , Justin Klein , Maximilian Schieder , Jonas Stein , Daniëlle Schuman , Thomas Cope , Steffen Udluft

On-Policy RL with Optimal Reward Baseline

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability…

Machine Learning · Computer Science 2025-06-05 Yaru Hao , Li Dong , Xun Wu , Shaohan Huang , Zewen Chi , Furu Wei

Reward Constrained Policy Optimization

Solving tasks in Reinforcement Learning is no easy feat. As the goal of the agent is to maximize the accumulated reward, it often learns to exploit loopholes and misspecifications in the reward signal resulting in unwanted behavior. While…

Machine Learning · Computer Science 2018-12-27 Chen Tessler , Daniel J. Mankowitz , Shie Mannor

Reinforcement Learning Based Quantum Circuit Optimization via ZX-Calculus

We propose a novel Reinforcement Learning (RL) method for optimizing quantum circuits using graph-theoretic simplification rules of ZX-diagrams. The agent, trained using the Proximal Policy Optimization (PPO) algorithm, employs Graph Neural…

Quantum Physics · Physics 2025-06-04 Jordi Riu , Jan Nogué , Gerard Vilaplana , Artur Garcia-Saez , Marta P. Estarellas

PPO-Q: Proximal Policy Optimization with Parametrized Quantum Policies or Values

Quantum machine learning (QML), which combines quantum computing with machine learning, is widely believed to hold the potential to outperform traditional machine learning in the era of noisy intermediate-scale quantum (NISQ). As one of the…

Quantum Physics · Physics 2025-01-14 Yu-Xin Jin , Zi-Wei Wang , Hong-Ze Xu , Wei-Feng Zhuang , Meng-Jun Hu , Dong E. Liu

Q-Policy: Quantum-Enhanced Policy Evaluation for Scalable Reinforcement Learning

We propose Q-Policy, a hybrid quantum-classical reinforcement learning (RL) framework that mathematically accelerates policy evaluation and optimization by exploiting quantum computing primitives. Q-Policy encodes value functions in quantum…

Machine Learning · Computer Science 2025-06-10 Kalyan Cherukuri , Aarav Lala , Yash Yardi

KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and…

Computation and Language · Computer Science 2025-08-26 Jason R Brown , Lennie Wells , Edward James Young , Sergio Bacallado

Preference Optimization for Combinatorial Optimization Problems

Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL…

Machine Learning · Computer Science 2025-05-14 Mingjun Pan , Guanquan Lin , You-Wei Luo , Bin Zhu , Zhien Dai , Lijun Sun , Chun Yuan

Quantum Algorithms for Reinforcement Learning with a Generative Model

Reinforcement learning studies how an agent should interact with an environment to maximize its cumulative reward. A standard way to study this question abstractly is to ask how many samples an agent needs from the environment to learn an…

Quantum Physics · Physics 2021-12-21 Daochen Wang , Aarthi Sundaram , Robin Kothari , Ashish Kapoor , Martin Roetteler

Reinforcement-Learning-Based Variational Quantum Circuits Optimization for Combinatorial Problems

Quantum computing exploits basic quantum phenomena such as state superposition and entanglement to perform computations. The Quantum Approximate Optimization Algorithm (QAOA) is arguably one of the leading quantum algorithms that can…

Machine Learning · Computer Science 2022-06-16 Sami Khairy , Ruslan Shaydulin , Lukasz Cincio , Yuri Alexeev , Prasanna Balaprakash

Behavior Proximal Policy Optimization

Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to the overestimation of out-of-distribution state-action pairs. Thus, various additional augmentations are…

Machine Learning · Computer Science 2023-02-23 Zifeng Zhuang , Kun Lei , Jinxin Liu , Donglin Wang , Yilang Guo

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn…

Machine Learning · Computer Science 2026-01-27 Junbo Li , Peng Zhou , Rui Meng , Meet P. Vadera , Lihong Li , Yang Li

Towards an Understanding of Default Policies in Multitask Policy Optimization

Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms with strong performance across multiple domains. In this family of methods, agents are trained to maximize…

Machine Learning · Computer Science 2022-03-24 Ted Moskovitz , Michael Arbel , Jack Parker-Holder , Aldo Pacchiano

Bounded Ratio Reinforcement Learning

Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying…

Machine Learning · Computer Science 2026-05-01 Yunke Ao , Le Chen , Bruce D. Lee , Assefa S. Wahd , Aline Czarnobai , Philipp Fürnstahl , Bernhard Schölkopf , Andreas Krause

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL…

Machine Learning · Computer Science 2024-12-17 Shutong Ding , Ke Hu , Zhenhao Zhang , Kan Ren , Weinan Zhang , Jingyi Yu , Jingya Wang , Ye Shi

Constrained Policy Optimization

For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact…

Machine Learning · Computer Science 2017-05-31 Joshua Achiam , David Held , Aviv Tamar , Pieter Abbeel