English
Related papers

Related papers: Reparameterization Flow Policy Optimization

200 papers

By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive…

Machine Learning · Computer Science 2026-02-09 Hai Zhong , Xun Wang , Zhuoran Li , Longbo Huang

We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used…

Machine Learning · Computer Science 2023-07-21 Zhiao Huang , Litian Liang , Zhan Ling , Xuanlin Li , Chuang Gan , Hao Su

Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm…

Machine Learning · Computer Science 2025-08-04 David McAllister , Songwei Ge , Brent Yi , Chung Min Kim , Ethan Weber , Hongsuk Choi , Haiwen Feng , Angjoo Kanazawa

Flow-matching policies have emerged as a powerful paradigm for generalist robotics. These models are trained to imitate an action chunk, conditioned on sensor observations and textual instructions. Often, training demonstrations are…

Machine Learning · Computer Science 2025-07-22 Samuel Pfrommer , Yixiao Huang , Somayeh Sojoudi

We propose Flow-GRPO, the first method to integrate online policy gradient reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Jie Liu , Gongye Liu , Jiajun Liang , Yangguang Li , Jiaheng Liu , Xintao Wang , Pengfei Wan , Di Zhang , Wanli Ouyang

The policy gradient method enjoys the simplicity of the objective where the agent optimizes the cumulative reward directly. Moreover, in the continuous action domain, parameterized distribution of action distribution allows easy control of…

Machine Learning · Computer Science 2022-12-16 Md Masudur Rahman , Yexiang Xue

Interest in derivative-free optimization (DFO) and "evolutionary strategies" (ES) has recently surged in the Reinforcement Learning (RL) community, with growing evidence that they can match state of the art methods for policy optimization…

Among on-policy reinforcement learning algorithms, Proximal Policy Optimization (PPO) demonstrates is widely favored for its simplicity, numerical stability, and strong empirical performance. Standard PPO relies on surrogate objectives…

Machine Learning · Computer Science 2026-02-03 Shunpeng Yang , Ben Liu , Hua Chen

Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet their high variance often undermines training stability. Using pathwise policy gradients, i.e.…

ReParameterization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. However, recent studies have revealed that, when applied to long-term reinforcement learning…

Machine Learning · Computer Science 2023-11-01 Shenao Zhang , Boyi Liu , Zhaoran Wang , Tuo Zhao

Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which…

Machine Learning · Computer Science 2026-04-02 Ruijie Hao , Longfei Zhang , Yang Dai , Yang Ma , Xingxing Liang , Guangquan Cheng

Vision-Language-Action (VLA) models such as OpenVLA, Octo, and $\pi_0$ have shown strong generalization by leveraging large-scale demonstrations, yet their performance is still fundamentally constrained by the quality and coverage of…

Machine Learning · Computer Science 2025-10-14 Mingyang Lyu , Yinqian Sun , Erliang Lin , Huangrui Li , Ruolin Chen , Feifei Zhao , Yi Zeng

Recent advances in constrained reinforcement learning (RL) have endowed reinforcement learning with certain safety guarantees. However, deploying existing constrained RL algorithms in continuous control tasks with general hard constraints…

Machine Learning · Computer Science 2023-12-22 Shutong Ding , Jingya Wang , Yali Du , Ye Shi

Hyperparameter optimization (HPO) plays a critical role in improving model performance. Transformer-based HPO methods have shown great potential; however, existing approaches rely heavily on large-scale historical optimization trajectories…

Machine Learning · Computer Science 2025-09-23 Haoxin Guo , Jiawen Pan , Weixin Zhai

While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate…

Computation and Language · Computer Science 2025-10-13 Shi-Qi Yan , Quan Liu , Zhen-Hua Ling

We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by…

Machine Learning · Computer Science 2026-04-17 Haoran Xu , Kaiwen Hu , Somayeh Sojoudi , Amy Zhang

Single-trajectory reinforcement learning (RL) methods aim to optimize policies from datasets consisting of (prompt, response, reward) triplets, where scalar rewards are directly available. This supervision format is highly practical, as it…

Machine Learning · Computer Science 2025-12-23 Bilal Faye , Hanane Azzag , Mustapha Lebbah

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from…

Machine Learning · Computer Science 2026-03-19 Ziyan Wang , Zheng Wang , Xingwei Qu , Qi Cheng , Jie Fu , Shengpu Tang , Minjia Zhang , Xiaoming Huo

Recent advances in large reasoning models have leveraged reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities. However, scaling these methods typically requires extensive rollout computation and large…

Machine Learning · Computer Science 2025-09-03 Xinyu Tang , Zhenduo Zhang , Yurou Liu , Wayne Xin Zhao , Zujie Wen , Zhiqiang Zhang , Jun Zhou

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low…

Computation and Language · Computer Science 2025-06-12 Siheng Li , Zhanhui Zhou , Wai Lam , Chao Yang , Chaochao Lu
‹ Prev 1 2 3 10 Next ›