English

Off-Belief Learning

Artificial Intelligence 2021-08-19 v5 Machine Learning

Abstract

The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents' actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy π1\pi_1 that is optimized assuming past actions were taken by a given, fixed policy (π0\pi_0), but assuming that future actions will be taken by π1\pi_1. When π0\pi_0 is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents' behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC). OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.

Keywords

Cite

@article{arxiv.2103.04000,
  title  = {Off-Belief Learning},
  author = {Hengyuan Hu and Adam Lerer and Brandon Cui and David Wu and Luis Pineda and Noam Brown and Jakob Foerster},
  journal= {arXiv preprint arXiv:2103.04000},
  year   = {2021}
}
R2 v1 2026-06-23T23:49:36.512Z