Related papers: Execution-Verified Reinforcement Learning for Opti…

RoVer: Robot Reward Model as Test-Time Verifier for Vision-Language-Action Model

Vision-Language-Action (VLA) models have become a prominent paradigm for embodied intelligence, yet further performance improvements typically rely on scaling up training data and model size -- an approach that is prohibitively expensive…

Robotics · Computer Science 2025-10-15 Mingtong Dai , Lingbo Liu , Yongjie Bai , Yang Liu , Zhouxia Wang , Rui SU , Chunjie Chen , Liang Lin , Xinyu Wu

Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning

Recent work on reinforcement learning with verifiable rewards (RLVR) has shown that large language models (LLMs) can be substantially improved using outcome-level verification signals, such as unit tests for code or exact-match checks for…

Computation and Language · Computer Science 2026-01-27 Massimiliano Pronesti , Anya Belz , Yufang Hou

Offline Reinforcement Learning with Value-based Episodic Memory

Offline reinforcement learning (RL) shows promise of applying RL to real-world problems by effectively utilizing previously collected data. Most existing offline RL algorithms use regularization or constraints to suppress extrapolation…

Machine Learning · Computer Science 2021-10-20 Xiaoteng Ma , Yiqin Yang , Hao Hu , Qihan Liu , Jun Yang , Chongjie Zhang , Qianchuan Zhao , Bin Liang

EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the…

Machine Learning · Computer Science 2026-04-21 Huanyu Liu , Jia Li , Yihong Dong , Chang Yu , Taozhi Chen , Lecheng Wang , Yongding Tao , Bin Gu , Ge Li

Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR)…

Machine Learning · Computer Science 2025-10-24 Dian Yu , Yulai Zhao , Kishan Panaganti , Linfeng Song , Haitao Mi , Dong Yu

Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning

Reinforcement learning with verifiable rewards (RLVR) has demonstrated superior performance in enhancing the reasoning capability of large language models (LLMs). However, this accuracy-oriented learning paradigm often suffers from entropy…

Artificial Intelligence · Computer Science 2026-01-19 Hongye Cao , Zhixin Bai , Ziyue Peng , Boyan Wang , Tianpei Yang , Jing Huo , Yuyao Zhang , Yang Gao

ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Large Vision-Language Models (LVLMs) have recently advanced robotic manipulation by leveraging vision for scene perception and language for instruction following. However, existing methods rely heavily on costly human-annotated training…

Robotics · Computer Science 2025-05-27 Zirui Song , Guangxian Ouyang , Mingzhe Li , Yuheng Ji , Chenxi Wang , Zixiang Xu , Zeyu Zhang , Xiaoqing Zhang , Qian Jiang , Zhenhao Chen , Zhongzhi Li , Rui Yan , Xiuying Chen

Self-Execution Simulation Improves Coding Models

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code…

Computation and Language · Computer Science 2026-04-07 Gallil Maimon , Ori Yoran , Felix Kreuk , Michael Hassid , Gal Cohen , Pierre Chambon , Yossi Adi

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

Code LLMs still struggle with code execution reasoning, especially in smaller models. Existing methods rely on supervised fine-tuning (SFT) with teacher-generated explanations, primarily in two forms: (1) input-output (I/O) prediction…

Software Engineering · Computer Science 2026-03-13 Lingxiao Tang , He Ye , Zhaoyang Chu , Muyang Ye , Zhongxin Liu , Xiaoxue Ren , Lingfeng Bao

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their…

Computer Vision and Pattern Recognition · Computer Science 2026-03-16 Omkar Thawakar , Shravan Venkatraman , Ritesh Thawkar , Abdelrahman Shaker , Hisham Cholakkal , Rao Muhammad Anwer , Salman Khan , Fahad Khan

Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation

Reinforcement Learning with Verifiable Rewards(RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs). However, its success has thus far been largely confined to the mathematical and…

Artificial Intelligence · Computer Science 2026-02-05 Mengyu Zhang , Siyu Ding , Weichong Yin , Yu Sun , Hua Wu

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Junxin Wang , Dai Guan , Weijie Qiu , Zhihang Li , Yongbo Gai , Zhengyi Yang , Mengyu Zhou , Erchao Zhao , Xiaoxi Jiang , Guanjun Jiang

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

Language models encode substantial evaluative knowledge from pretraining, yet current post-training methods rely on external supervision (human annotations, proprietary models, or scalar reward models) to produce reward signals. Each…

Artificial Intelligence · Computer Science 2026-05-06 Shuyue Stella Li , Rui Xin , Teng Xiao , Yike Wang , Rulin Shao , Zoey Hao , Melanie Sclar , Sewoong Oh , Faeze Brahman , Pang Wei Koh , Yulia Tsvetkov

Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine…

Computation and Language · Computer Science 2026-02-20 Haotong Yang , Zitong Wang , Shijia Kang , Siqi Yang , Wenkai Yu , Xu Niu , Yike Sun , Yi Hu , Zhouchen Lin , Muhan Zhang

Efficient Reinforcement Learning for Large Language Models with Intrinsic Exploration

Reinforcement learning with verifiable rewards (RLVR) has improved the reasoning ability of large language models, yet training remains costly because many rollouts contribute little to optimization, considering the amount of computation…

Machine Learning · Computer Science 2026-02-20 Yan Sun , Jia Guo , Stanley Kok , Zihao Wang , Zujie Wen , Zhiqiang Zhang

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

Self-evolution of multimodal large language models (MLLMs) remains a critical challenge: pseudo-label-based methods suffer from progressive quality degradation as model predictions drift, while template-based methods are confined to a…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Yongrui Heng , Chaoya Jiang , Han Yang , Shikun Zhang , Wei Ye

RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other…

Computation and Language · Computer Science 2026-03-05 Peisong Wang , Ruotian Ma , Bang Zhang , Xingyu Chen , Zhiwei He , Kang Luo , Qingsong Lv , Qingxuan Jiang , Zheng Xie , Shanyi Wang , Yuan Li , Fanghua Ye , Jian Li , Yifan Yang , Zhaopeng Tu , Xiaolong Li

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

Optimization modeling is fundamental to decision-making across diverse domains. Despite progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally…

Artificial Intelligence · Computer Science 2025-12-23 Yitian Chen , Jingfan Xia , Siyu Shao , Dongdong Ge , Yinyu Ye

Towards Execution-Grounded Automated AI Research

Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is…

Computation and Language · Computer Science 2026-01-22 Chenglei Si , Zitong Yang , Yejin Choi , Emmanuel Candès , Diyi Yang , Tatsunori Hashimoto

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward

Recent advances in large reasoning models have leveraged reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities. However, scaling these methods typically requires extensive rollout computation and large…

Machine Learning · Computer Science 2025-09-03 Xinyu Tang , Zhenduo Zhang , Yurou Liu , Wayne Xin Zhao , Zujie Wen , Zhiqiang Zhang , Jun Zhou