English

Policy Learning Using Weak Supervision

Machine Learning 2021-11-03 v3 Artificial Intelligence Machine Learning

Abstract

Most existing policy learning solutions require the learning agents to receive high-quality supervision signals such as well-designed rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC). These quality supervisions are usually infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervision" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our approach explicitly punishes a policy for overfitting to the weak supervision. In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high.

Keywords

Cite

@article{arxiv.2010.01748,
  title  = {Policy Learning Using Weak Supervision},
  author = {Jingkang Wang and Hongyi Guo and Zhaowei Zhu and Yang Liu},
  journal= {arXiv preprint arXiv:2010.01748},
  year   = {2021}
}

Comments

NeurIPS 2021

R2 v1 2026-06-23T19:01:38.779Z