Related papers: Optimistic Multi-Agent Policy Gradient

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

In decentralized multi-agent reinforcement learning, agents learning in isolation can lead to relative over-generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in…

Machine Learning · Computer Science 2024-11-19 Ting Zhu , Yue Jin , Jeremie Houssineau , Giovanni Montana

CURO: Curriculum Learning for Relative Overgeneralization

Relative overgeneralization (RO) is a pathology that can arise in cooperative multi-agent tasks when the optimal joint action's utility falls below that of a sub-optimal joint action. RO can cause the agents to get stuck into local optima…

Machine Learning · Computer Science 2024-09-24 Lin Shi , Qiyuan Liu , Bei Peng

Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning

Policy optimization methods with function approximation are widely used in multi-agent reinforcement learning. However, it remains elusive how to design such algorithms with statistical guarantees. Leveraging a multi-agent performance…

Machine Learning · Computer Science 2023-05-09 Yulai Zhao , Zhuoran Yang , Zhaoran Wang , Jason D. Lee

Asynchronous, Option-Based Multi-Agent Policy Gradient: A Conditional Reasoning Approach

Cooperative multi-agent problems often require coordination between agents, which can be achieved through a centralized policy that considers the global state. Multi-agent policy gradient (MAPG) methods are commonly used to learn such…

Robotics · Computer Science 2023-08-03 Xubo Lyu , Amin Banitalebi-Dehkordi , Mo Chen , Yong Zhang

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct…

Computation and Language · Computer Science 2026-03-04 Yueyang Cang , Xiaoteng Zhang , Erlu Zhao , Zehua Ji , Yuhang Liu , Yuchen He , Zhiyuan Ning , Chen Yijun , Wenge Que , Li Shi

Settling the Variance of Multi-Agent Policy Gradients

Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the…

Machine Learning · Computer Science 2022-04-05 Jakub Grudzien Kuba , Muning Wen , Yaodong Yang , Linghui Meng , Shangding Gu , Haifeng Zhang , David Henry Mguni , Jun Wang

Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

Adversarial optimization algorithms that explicitly search for flaws in agents' policies have been successfully applied to finding robust and diverse policies in multi-agent settings. However, the success of adversarial optimization has…

Artificial Intelligence · Computer Science 2025-11-13 Niklas Lauffer , Ameesh Shah , Micah Carroll , Sanjit A. Seshia , Stuart Russell , Michael Dennis

MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems

Group Relative Policy Optimization (GRPO) has been shown to be an effective algorithm when an accurate reward model is available. However, such a highly reliable reward model is not available in many real-world tasks. In this paper, we…

Machine Learning · Computer Science 2026-01-12 Yuki Ichihara , Yuu Jinnai , Tetsuro Morimura , Mitsuki Sakamoto , Ryota Mitsuhashi , Eiji Uchibe

MAC-PO: Multi-Agent Experience Replay via Collective Priority Optimization

Experience replay is crucial for off-policy reinforcement learning (RL) methods. By remembering and reusing the experiences from past different policies, experience replay significantly improves the training efficiency and stability of RL…

Machine Learning · Computer Science 2023-03-01 Yongsheng Mei , Hanhan Zhou , Tian Lan , Guru Venkataramani , Peng Wei

Multi-Agent Constrained Policy Optimisation

Developing reinforcement learning algorithms that satisfy safety constraints is becoming increasingly important in real-world applications. In multi-agent reinforcement learning (MARL) settings, policy optimisation with safety awareness is…

Artificial Intelligence · Computer Science 2022-02-11 Shangding Gu , Jakub Grudzien Kuba , Munning Wen , Ruiqing Chen , Ziyan Wang , Zheng Tian , Jun Wang , Alois Knoll , Yaodong Yang

Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning

Ranking is a fundamental and widely studied problem in scenarios such as search, advertising, and recommendation. However, joint optimization for multi-scenario ranking, which aims to improve the overall performance of several ranking…

Artificial Intelligence · Computer Science 2018-09-18 Jun Feng , Heng Li , Minlie Huang , Shichen Liu , Wenwu Ou , Zhirong Wang , Xiaoyan Zhu

Reinforced Collaboration in Multi-Agent Flow Networks

Multi-agent systems provide a powerful way to extend large language models (LLMs) by decomposing a complex task into specialized subtasks handled by different agents. However, their performance is often hindered by error propagation,…

Machine Learning · Computer Science 2026-05-14 Zheng Wang , Yuang Liu , Yangkai Ding

Multiagent Soft Q-Learning

Policy gradient methods are often applied to reinforcement learning in continuous multiagent games. These methods perform local search in the joint-action space, and as we show, they are susceptable to a game-theoretic pathology known as…

Artificial Intelligence · Computer Science 2018-04-27 Ermo Wei , Drew Wicke , David Freelan , Sean Luke

Off-Policy Multi-Agent Decomposed Policy Gradients

Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we…

Machine Learning · Computer Science 2020-10-06 Yihan Wang , Beining Han , Tonghan Wang , Heng Dong , Chongjie Zhang

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning…

Artificial Intelligence · Computer Science 2022-04-05 Jakub Grudzien Kuba , Ruiqing Chen , Muning Wen , Ying Wen , Fanglei Sun , Jun Wang , Yaodong Yang

MACRPO: Multi-Agent Cooperative Recurrent Policy Optimization

This work considers the problem of learning cooperative policies in multi-agent settings with partially observable and non-stationary environments without a communication channel. We focus on improving information sharing between agents and…

Machine Learning · Computer Science 2021-09-03 Eshagh Kargar , Ville Kyrki

Multi-Agent Online Optimization with Delays: Asynchronicity, Adaptivity, and Optimism

In this paper, we provide a general framework for studying multi-agent online learning problems in the presence of delays and asynchronicities. Specifically, we propose and analyze a class of adaptive dual averaging schemes in which agents…

Machine Learning · Computer Science 2022-04-19 Yu-Guan Hsieh , Franck Iutzeler , Jérôme Malick , Panayotis Mertikopoulos

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn…

Machine Learning · Computer Science 2026-01-27 Junbo Li , Peng Zhou , Rui Meng , Meet P. Vadera , Lihong Li , Yang Li

Mars-PO: Multi-Agent Reasoning System Preference Optimization

Mathematical reasoning is a fundamental capability for large language models (LLMs), yet achieving high performance in this domain remains a significant challenge. The auto-regressive generation process often makes LLMs susceptible to…

Artificial Intelligence · Computer Science 2024-12-02 Xiaoxuan Lou , Chaojie Wang , Bo An

Coordinated Proximal Policy Optimization

We present Coordinated Proximal Policy Optimization (CoPPO), an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting. The key idea lies in the coordinated adaptation of step size during the…

Artificial Intelligence · Computer Science 2021-11-09 Zifan Wu , Chao Yu , Deheng Ye , Junge Zhang , Haiyin Piao , Hankz Hankui Zhuo