Related papers: $\mathcal{B}$-Coder: Value-Based Deep Reinforcemen…

Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or…

Computation and Language · Computer Science 2026-03-13 Chi Ruan , Dongfu Jiang , Yubo Wang , Wenhu Chen

Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

Reinforcement learning (RL) algorithms assume that users specify tasks by manually writing down a reward function. However, this process can be laborious and demands considerable technical expertise. Can we devise RL algorithms that instead…

Machine Learning · Computer Science 2022-01-03 Benjamin Eysenbach , Sergey Levine , Ruslan Salakhutdinov

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain.…

Software Engineering · Computer Science 2025-05-27 Huaye Zeng , Dongfu Jiang , Haozhe Wang , Ping Nie , Xiaotong Chen , Wenhu Chen

Learning to Synthesize Programs as Interpretable and Generalizable Policies

Recently, deep reinforcement learning (DRL) methods have achieved impressive performance on tasks in a variety of domains. However, neural network policies produced with DRL methods are not human-interpretable and often have difficulty…

Machine Learning · Computer Science 2022-02-02 Dweep Trivedi , Jesse Zhang , Shao-Hua Sun , Joseph J. Lim

Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis

Program synthesis is the task of automatically generating a program consistent with a specification. Recent years have seen proposal of a number of neural approaches for program synthesis, many of which adopt a sequence generation paradigm…

Machine Learning · Computer Science 2018-05-23 Rudy Bunel , Matthew Hausknecht , Jacob Devlin , Rishabh Singh , Pushmeet Kohli

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical…

Machine Learning · Computer Science 2022-11-04 Hung Le , Yue Wang , Akhilesh Deepak Gotmare , Silvio Savarese , Steven C. H. Hoi

ReCode: Reinforcing Code Generation with Reasoning-Process Rewards

In practice, rigorous reasoning is often a key driver of correct code, while Reinforcement Learning (RL) for code generation often neglects optimizing reasoning quality. Bringing process-level supervision into RL is appealing, but it faces…

Software Engineering · Computer Science 2026-05-06 Lishui Fan , Yu Zhang , Mouxiang Chen , Zhongxin Liu

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each…

Machine Learning · Computer Science 2026-03-25 Peng-Yuan Wang , Ziniu Li , Tian Xu , Bohan Yang , Tian-Shuo Liu , ChenYang Wang , Xiong-Hui Chen , Yi-Chen Li , Tianyun Yang , Congliang Chen , Yang Yu

B-Pref: Benchmarking Preference-Based Reinforcement Learning

Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a…

Machine Learning · Computer Science 2021-11-05 Kimin Lee , Laura Smith , Anca Dragan , Pieter Abbeel

From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning

Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical…

Machine Learning · Computer Science 2025-10-08 Yuzhen Huang , Weihao Zeng , Xingshan Zeng , Qi Zhu , Junxian He

Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization

Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy that maximizes the cumulative rewards in sequential decision making. Most of methods in the existing…

Machine Learning · Statistics 2023-01-06 Chengchun Shi , Zhengling Qi , Jianing Wang , Fan Zhou

Process Supervision-Guided Policy Optimization for Code Generation

Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental…

Artificial Intelligence · Computer Science 2025-02-05 Ning Dai , Zheng Wu , Renjie Zheng , Ziyun Wei , Wenlei Shi , Xing Jin , Guanlin Liu , Chen Dun , Liang Huang , Lin Yan

Process-Supervised Reinforcement Learning for Code Generation

Existing reinforcement learning strategies based on outcome supervision have proven effective in enhancing the performance of large language models(LLMs) for code generation. While reinforcement learning based on process supervision has…

Software Engineering · Computer Science 2025-02-05 Yufan Ye , Ting Zhang , Wenbin Jiang , Hua Huang

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Haozhan Shen , Peng Liu , Jingcheng Li , Chunxin Fang , Yibo Ma , Jiajia Liao , Qiaoli Shen , Zilun Zhang , Kangjia Zhao , Qianqian Zhang , Ruochen Xu , Tiancheng Zhao

Mapping Language to Programs using Multiple Reward Components with Inverse Reinforcement Learning

Mapping natural language instructions to programs that computers can process is a fundamental challenge. Existing approaches focus on likelihood-based training or using reinforcement learning to fine-tune models based on a single reward. In…

Computation and Language · Computer Science 2021-10-05 Sayan Ghosh , Shashank Srivastava

Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs

Code-generating Large Language Models (LLMs) have become essential tools in modern software development, enhancing productivity and accelerating development. This paper aims to investigate the fine-tuning of code-generating LLMs using…

Software Engineering · Computer Science 2025-05-06 Marina Sakharova , Abhinav Anand , Mira Mezini

Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards

Large language models show strong potential for automated code generation, but lack guarantees for correctness, quality, safety, and domain-specific constraints. For instance in robotics, where code generation is increasingly being used for…

Machine Learning · Computer Science 2026-05-21 Erfan Aghadavoodi Jolfaei , Daniel Maninger , Abhinav Anand , Mert Tiftikci , Mira Mezini

Safe Policy Improvement in Constrained Markov Decision Processes

The automatic synthesis of a policy through reinforcement learning (RL) from a given set of formal requirements depends on the construction of a reward signal and consists of the iterative application of many policy-improvement steps. The…

Machine Learning · Computer Science 2022-10-21 Luigi Berducci , Radu Grosu

Code as Reward: Empowering Reinforcement Learning with VLMs

Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and provide feedback on task completion. In this paper, we aim to leverage these capabilities to support…

Machine Learning · Computer Science 2024-02-08 David Venuto , Sami Nur Islam , Martin Klissarov , Doina Precup , Sherry Yang , Ankit Anand

Reinforcement Learning for Sequence Design Leveraging Protein Language Models

Protein sequence design, determined by amino acid sequences, are essential to protein engineering problems in drug discovery. Prior approaches have resorted to evolutionary strategies or Monte-Carlo methods for protein design, but often…

Machine Learning · Computer Science 2024-11-19 Jithendaraa Subramanian , Shivakanth Sujit , Niloy Irtisam , Umong Sain , Riashat Islam , Derek Nowrouzezahrai , Samira Ebrahimi Kahou