Related papers: Value-Based Deep RL Scales Predictably

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Improving data utilization efficiency is critical for scaling reinforcement learning (RL) for long-horizon tasks where generating trajectories is expensive. However, the dominant RL methods for LLMs are largely on-policy: they update each…

Machine Learning · Computer Science 2026-03-25 Peng-Yuan Wang , Ziniu Li , Tian Xu , Bohan Yang , Tian-Shuo Liu , ChenYang Wang , Xiong-Hui Chen , Yi-Chen Li , Tianyun Yang , Congliang Chen , Yang Yu

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning…

Computation and Language · Computer Science 2025-12-04 Joey Hong , Anca Dragan , Sergey Levine

A Nonparametric Off-Policy Policy Gradient

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient…

Machine Learning · Computer Science 2020-08-04 Samuele Tosatto , Joao Carvalho , Hany Abdulsamad , Jan Peters

Compute-Optimal Scaling for Value-Based Deep RL

As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per…

Machine Learning · Computer Science 2025-08-26 Preston Fu , Oleh Rybkin , Zhiyuan Zhou , Michal Nauman , Pieter Abbeel , Sergey Levine , Aviral Kumar

LLMs Can Learn to Reason Via Off-Policy RL

Reinforcement learning (RL) approaches for Large Language Models (LLMs) frequently use on-policy algorithms, such as PPO or GRPO. However, policy lag from distributed training architectures and differences between the training and inference…

Machine Learning · Computer Science 2026-03-03 Daniel Ritter , Owen Oertell , Bradley Guo , Jonathan Chang , Kianté Brantley , Wen Sun

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Predicting changes from scaling advanced AI systems is a desirable property for engineers, economists, governments and industry alike, and, while a well-established literature exists on how pretraining performance scales, predictable…

Machine Learning · Computer Science 2025-02-07 Rylan Schaeffer , Hailey Schoelkopf , Brando Miranda , Gabriel Mukobi , Varun Madan , Adam Ibrahim , Herbie Bradley , Stella Biderman , Sanmi Koyejo

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the…

Computation and Language · Computer Science 2026-03-10 Chengyin Xu , Kaiyuan Chen , Xiao Li , Ke Shen , Chenggang Li

Language models scale reliably with over-training and on downstream tasks

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Computation and Language · Computer Science 2024-06-18 Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang , Jeffrey Li , Sedrick Keh , Rui Xin , Marianna Nezhurina , Igor Vasiljevic , Jenia Jitsev , Luca Soldaini , Alexandros G. Dimakis , Gabriel Ilharco , Pang Wei Koh , Shuran Song , Thomas Kollar , Yair Carmon , Achal Dave , Reinhard Heckel , Niklas Muennighoff , Ludwig Schmidt

Learning Off-policy with Model-based Intrinsic Motivation For Active Online Exploration

Recent advancements in deep reinforcement learning (RL) have demonstrated notable progress in sample efficiency, spanning both model-based and model-free paradigms. Despite the identification and mitigation of specific bottlenecks in prior…

Machine Learning · Computer Science 2024-04-02 Yibo Wang , Jiang Zhao

RAPID: An Efficient Reinforcement Learning Algorithm for Small Language Models

Reinforcement learning (RL) has emerged as a promising strategy for finetuning small language models (SLMs) to solve targeted tasks such as math and coding. However, RL algorithms tend to be resource-intensive, taking a significant amount…

Machine Learning · Computer Science 2025-10-07 Lianghuan Huang , Sagnik Anupam , Insup Lee , Shuo Li , Osbert Bastani

The Art of Scaling Reinforcement Learning Compute for LLMs

Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is…

Machine Learning · Computer Science 2025-10-16 Devvrit Khatri , Lovish Madaan , Rishabh Tiwari , Rachit Bansal , Sai Surya Duvvuri , Manzil Zaheer , Inderjit S. Dhillon , David Brandfonbrener , Rishabh Agarwal

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pre training compute budget, what downstream accuracy is attainable with contemporary post training practice, and how stable is that mapping…

Machine Learning · Computer Science 2026-02-18 Hanlin Zhang , Jikai Jin , Vasilis Syrgkanis , Sham Kakade

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network…

Machine Learning · Computer Science 2026-05-19 Shijin Gong , Kai Ye , Jin Zhu , Xinyu Zhang , Hongyi Zhou , Chengchun Shi

Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient

Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high…

Machine Learning · Computer Science 2021-06-09 Samuele Tosatto , João Carvalho , Jan Peters

A Dataset Perspective on Offline Reinforcement Learning

The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited.…

Machine Learning · Computer Science 2022-07-13 Kajetan Schweighofer , Andreas Radler , Marius-Constantin Dinu , Markus Hofmarcher , Vihang Patil , Angela Bitto-Nemling , Hamid Eghbal-zadeh , Sepp Hochreiter

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling…

Machine Learning · Computer Science 2025-02-19 Amrith Setlur , Nived Rajaraman , Sergey Levine , Aviral Kumar

Policy Learning from Large Vision-Language Model Feedback without Reward Modeling

Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using pre-collected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is…

Machine Learning · Computer Science 2025-08-01 Tung M. Luu , Donghoon Lee , Younghwan Lee , Chang D. Yoo

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct…

Machine Learning · Computer Science 2025-12-10 Jakub Krajewski , Amitis Shidani , Dan Busbridge , Sam Wiseman , Jason Ramapuram

Statistically Efficient Off-Policy Policy Gradients

Policy gradient methods in reinforcement learning update policy parameters by taking steps in the direction of an estimated gradient of policy value. In this paper, we consider the statistically efficient estimation of policy gradients from…

Machine Learning · Statistics 2020-02-21 Nathan Kallus , Masatoshi Uehara

Rethinking Scale: The Efficacy of Fine-Tuned Open-Source LLMs in Large-Scale Reproducible Social Science Research

Large Language Models (LLMs) are distinguished by their architecture, which dictates their parameter size and performance capabilities. Social scientists have increasingly adopted LLMs for text classification tasks, which are difficult to…

Computation and Language · Computer Science 2024-11-05 Marcello Carammia , Stefano Maria Iacus , Giuseppe Porro