UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Yiqun Chen; Wei Yang; Erhan Zhang; Shijie Wang; Qi Liu; Zechun Niu; Bin Zhang; Haitao Li; Rui Li; Lingyong Yan; Jinyuan Feng; Biqing Qi; Xiaochi Wei; Yan Gao; Yi Wu; Yao Hu; Jiaxin Mao

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Artificial Intelligence 2026-05-27 v1 Computation and Language Multiagent Systems

Authors: Yiqun Chen , Wei Yang , Erhan Zhang , Shijie Wang , Qi Liu , Zechun Niu , Bin Zhang , Haitao Li , Rui Li , Lingyong Yan , Jinyuan Feng , Biqing Qi , Xiaochi Wei , Yan Gao , Yi Wu , Yao Hu , Jiaxin Mao

View on arXiv ↗ PDF ↗

Abstract

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

Keywords

multi-agent systems multi-agent reasoning multi-agent reinforcement learning

Cite

@article{arxiv.2605.26646,
  title  = {UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems},
  author = {Yiqun Chen and Wei Yang and Erhan Zhang and Shijie Wang and Qi Liu and Zechun Niu and Bin Zhang and Haitao Li and Rui Li and Lingyong Yan and Jinyuan Feng and Biqing Qi and Xiaochi Wei and Yan Gao and Yi Wu and Yao Hu and Jiaxin Mao},
  journal= {arXiv preprint arXiv:2605.26646},
  year   = {2026}
}

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Abstract

Keywords

Cite

Related papers