English

Localizing Model Behavior with Path Patching

Machine Learning 2023-05-17 v2

Abstract

Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

Keywords

Cite

@article{arxiv.2304.05969,
  title  = {Localizing Model Behavior with Path Patching},
  author = {Nicholas Goldowsky-Dill and Chris MacLeod and Lucas Sato and Aryaman Arora},
  journal= {arXiv preprint arXiv:2304.05969},
  year   = {2023}
}

Comments

20 pages, 16 figures