Aligning AI With Shared Human Values

Dan Hendrycks; Collin Burns; Steven Basart; Andrew Critch; Jerry Li; Dawn Song; Jacob Steinhardt

Aligning AI With Shared Human Values

Computers and Society 2023-02-20 v6 Artificial Intelligence Computation and Language Machine Learning

Authors: Dan Hendrycks , Collin Burns , Steven Basart , Andrew Critch , Jerry Li , Dawn Song , Jacob Steinhardt

Abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Keywords

ethics fairness in machine learning artificial intelligence

Cite

@article{arxiv.2008.02275,
  title  = {Aligning AI With Shared Human Values},
  author = {Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal= {arXiv preprint arXiv:2008.02275},
  year   = {2023}
}

Comments

ICLR 2021; the ETHICS dataset is available at https://github.com/hendrycks/ethics/

Aligning AI With Shared Human Values

Abstract

Keywords

Cite

Comments

Related papers