Related papers: Reparameterization Flow Policy Optimization

Reparameterization Proximal Policy Optimization

By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive…

Machine Learning · Computer Science 2026-02-09 Hai Zhong , Xun Wang , Zhuoran Li , Longbo Huang

Reparameterized Policy Learning for Multimodal Trajectory Optimization

We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used…

Machine Learning · Computer Science 2023-07-21 Zhiao Huang , Litian Liang , Zhan Ling , Xuanlin Li , Chuang Gan , Hao Su

Flow Matching Policy Gradients

Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm…

Machine Learning · Computer Science 2025-08-04 David McAllister , Songwei Ge , Brent Yi , Chung Min Kim , Ethan Weber , Hongsuk Choi , Haiwen Feng , Angjoo Kanazawa

Reinforcement Learning for Flow-Matching Policies

Flow-matching policies have emerged as a powerful paradigm for generalist robotics. These models are trained to imitate an action chunk, conditioned on sensor observations and textual instructions. Often, training demonstrations are…

Machine Learning · Computer Science 2025-07-22 Samuel Pfrommer , Yixiao Huang , Somayeh Sojoudi

Flow-GRPO: Training Flow Matching Models via Online RL

We propose Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Jie Liu , Gongye Liu , Jiajun Liang , Yangguang Li , Jiaheng Liu , Xintao Wang , Pengfei Wan , Di Zhang , Wanli Ouyang

Robust Policy Optimization in Deep Reinforcement Learning

The policy gradient method enjoys the simplicity of the objective where the agent optimizes the cumulative reward directly. Moreover, in the continuous action domain, parameterized distribution of action distribution allows easy control of…

Machine Learning · Computer Science 2022-12-16 Md Masudur Rahman , Yexiang Xue

Provably Robust Blackbox Optimization for Reinforcement Learning

Interest in derivative-free optimization (DFO) and "evolutionary strategies" (ES) has recently surged in the Reinforcement Learning (RL) community, with growing evidence that they can match state of the art methods for policy optimization…

Machine Learning · Computer Science 2019-07-09 Krzysztof Choromanski , Aldo Pacchiano , Jack Parker-Holder , Yunhao Tang , Deepali Jain , Yuxiang Yang , Atil Iscen , Jasmine Hsu , Vikas Sindhwani

PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning

Among on-policy reinforcement learning algorithms, Proximal Policy Optimization (PPO) demonstrates is widely favored for its simplicity, numerical stability, and strong empirical performance. Standard PPO relies on surrogate objectives…

Machine Learning · Computer Science 2026-02-03 Shunpeng Yang , Ben Liu , Hua Chen

Relative Entropy Pathwise Policy Optimization

Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e.…

Machine Learning · Computer Science 2026-04-14 Claas Voelcker , Axel Brunnbauer , Marcel Hussing , Michal Nauman , Pieter Abbeel , Eric Eaton , Radu Grosu , Amir-massoud Farahmand , Igor Gilitschenski

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms

ReParameterization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. However, recent studies have revealed that, when applied to long-term reinforcement learning…

Machine Learning · Computer Science 2023-11-01 Shenao Zhang , Boyi Liu , Zhaoran Wang , Tuo Zhao

Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which…

Machine Learning · Computer Science 2026-04-02 Ruijie Hao , Longfei Zhang , Yang Dai , Yang Ma , Xingxing Liang , Guangquan Cheng

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $\pi_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of…

Machine Learning · Computer Science 2025-10-14 Mingyang Lyu , Yinqian Sun , Erliang Lin , Huangrui Li , Ruolin Chen , Feifei Zhao , Yi Zeng

Reduced Policy Optimization for Continuous Control with Hard Constraints

Recent advances in constrained reinforcement learning (RL) have endowed reinforcement learning with certain safety guarantees. However, deploying existing constrained RL algorithms in continuous control tasks with general hard constraints…

Machine Learning · Computer Science 2023-12-22 Shutong Ding , Jingya Wang , Yali Du , Ye Shi

GRPOformer: Advancing Hyperparameter Optimization via Group Relative Policy Optimization

Hyperparameter optimization (HPO) plays a critical role in improving model performance. Transformer-based HPO methods have shown great potential; however, existing approaches rely heavily on large-scale historical optimization trajectories…

Machine Learning · Computer Science 2025-09-23 Haoxin Guo , Jiawen Pan , Weixin Zhai

RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation

While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate…

Computation and Language · Computer Science 2025-10-13 Shi-Qi Yan , Quan Liu , Zhen-Hua Ling

Reinforcement Learning via Value Gradient Flow

We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by…

Machine Learning · Computer Science 2026-04-17 Haoran Xu , Kaiwen Hu , Somayeh Sojoudi , Amy Zhang

Value-Free Policy Optimization via Reward Partitioning

Single-trajectory reinforcement learning (RL) methods aim to optimize policies from datasets consisting of (prompt, response, reward) triplets, where scalar rewards are directly available. This supervision format is highly practical, as it…

Machine Learning · Computer Science 2025-12-23 Bilal Faye , Hanane Azzag , Mustapha Lebbah

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from…

Machine Learning · Computer Science 2026-03-19 Ziyan Wang , Zheng Wang , Xingwei Qu , Qi Cheng , Jie Fu , Shengpu Tang , Minjia Zhang , Xiaoming Huo

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward

Recent advances in large reasoning models have leveraged reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities. However, scaling these methods typically requires extensive rollout computation and large…

Machine Learning · Computer Science 2025-09-03 Xinyu Tang , Zhenduo Zhang , Yurou Liu , Wayne Xin Zhao , Zujie Wen , Zhiqiang Zhang , Jun Zhou

RePO: Replay-Enhanced Policy Optimization

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low…

Computation and Language · Computer Science 2025-06-12 Siheng Li , Zhanhui Zhou , Wai Lam , Chao Yang , Chaochao Lu