English
Related papers

Related papers: Exploration-Driven Optimization for Test-Time Larg…

200 papers

Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for…

Artificial Intelligence · Computer Science 2026-05-06 Ifdita Hasan Orney , Jubayer Ibn Hamid , Shreya S Ramanujam , Shirley Wu , Hengyuan Hu , Noah Goodman , Dorsa Sadigh , Chelsea Finn

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with…

Computation and Language · Computer Science 2025-07-29 Songjun Tu , Jiahao Lin , Xiangyu Tian , Qichao Zhang , Linjing Li , Yuqian Fu , Nan Xu , Wei He , Xiangyuan Lan , Dongmei Jiang , Dongbin Zhao

Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning,…

Artificial Intelligence · Computer Science 2026-05-29 Siyao Song , Cong Ma , Zhihao Cheng , Shiye Lei , Minghao Li , Ying Zeng , Huaixiao Tou , Kai Jia

Inference scaling further accelerates Large Language Models (LLMs) toward Artificial General Intelligence (AGI), with large-scale Reinforcement Learning (RL) to unleash long Chain-of-Thought reasoning. Most contemporary reasoning approaches…

Machine Learning · Computer Science 2025-05-20 Xuerui Su , Liya Guo , Yue Wang , Yi Zhu , Zhiming Ma , Zun Wang , Yuting Liu

Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to…

Computation and Language · Computer Science 2024-07-11 Yifan Song , Da Yin , Xiang Yue , Jie Huang , Sujian Li , Bill Yuchen Lin

Self-improvement via RL often fails on complex reasoning tasks because GRPO-style post-training methods rely on the model's initial ability to generate positive samples. Without guided exploration, these approaches merely reinforce what the…

Machine Learning · Computer Science 2026-01-28 Ruiyang Zhou , Shuozhe Li , Amy Zhang , Liu Leqi

Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO),…

Machine Learning · Computer Science 2025-10-13 Chen Wang , Lai Wei , Yanzhi Zhang , Chenyang Shao , Zedong Dan , Weiran Huang , Yuzhi Zhang , Yue Wang

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging…

Machine Learning · Computer Science 2026-05-26 Udbhav Bamba , Minghao Fang , Yifan Yu , Haizhong Zheng , Fan Lai

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong…

Computation and Language · Computer Science 2025-11-10 Chenxi Liu , Junjie Liang , Yuqi Jia , Bochuan Cao , Yang Bai , Heng Huang , Xun Chen

Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to…

Machine Learning · Computer Science 2026-02-24 Yuchen Zhu , Wei Guo , Jaemoo Choi , Petr Molodyk , Bo Yuan , Molei Tao , Yongxin Chen

Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business…

Computation and Language · Computer Science 2025-05-29 Xiaoqian Liu , Ke Wang , Yongbin Li , Yuchuan Wu , Wentao Ma , Aobo Kong , Fei Huang , Jianbin Jiao , Junge Zhang

Reinforcement Learning with Verifiable Rewards (RLVR), primarily driven by the Group Relative Policy Optimization (GRPO) algorithm, is a leading approach for enhancing the reasoning abilities of Large Language Models (LLMs). Despite its…

Machine Learning · Computer Science 2025-10-21 Kangqi Ni , Zhen Tan , Zijie Liu , Pingzhi Li , Tianlong Chen

The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT)…

While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of…

Artificial Intelligence · Computer Science 2026-04-21 Xin Guan , Zijian Li , Shen Huang , Pengjun Xie , Jingren Zhou , Jiuxin Cao

Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy…

Machine Learning · Computer Science 2026-01-28 Kishan Panaganti , Zhenwen Liang , Wenhao Yu , Haitao Mi , Dong Yu

Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy…

Computation and Language · Computer Science 2026-04-14 Liang Chen , Xueting Han , Qizhou Wang , Bo Han , Jing Bai , Hinrich Schutze , Kam-Fai Wong

Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning…

Machine Learning · Computer Science 2026-02-12 Kevin Rojas , Jiahe Lin , Kashif Rasul , Anderson Schneider , Yuriy Nevmyvaka , Molei Tao , Wei Deng

Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this…

Machine Learning · Computer Science 2025-10-14 Jens Tuyls , Dylan J. Foster , Akshay Krishnamurthy , Jordan T. Ash

Recent advancements in reasoning-focused language models such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on…

Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood…

Computation and Language · Computer Science 2025-12-04 Jingyang Ou , Jiaqi Han , Minkai Xu , Shaoxuan Xu , Jianwen Xie , Stefano Ermon , Yi Wu , Chongxuan Li
‹ Prev 1 2 3 10 Next ›