RL agents Implicitly Learning Human Preferences

Nevan Wichers

RL agents Implicitly Learning Human Preferences

Artificial Intelligence 2020-02-17 v1 Machine Learning

Authors: Nevan Wichers

Abstract

In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.

RL agents Implicitly Learning Human Preferences

Abstract

Keywords

Cite

Related papers