Patch-Effect Graph Kernels for LLM Interpretability

Ruben Fernandez-Boullon; David N. Olivieri

Patch-Effect Graph Kernels for LLM Interpretability

Artificial Intelligence 2026-05-08 v1 Computation and Language

Authors: Ruben Fernandez-Boullon , David N. Olivieri

Abstract

Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot features provide higher classification accuracy than global graph-shape descriptors. A screened paired-patching validation suggests that CI and PC selected candidate edges correspond to stronger activation-influence effects than random or low-rank candidates. Crucially, by evaluating these representations against rigorous prompt-only and raw patch-effect controls, we make the evidential scope of the benchmark explicit: graph features compress structured patching signal, while raw tensors and surface cues define strong baselines that any circuit-level claim should address. Ultimately, our framework provides a compression and evaluation pipeline for comparing patching-derived structures under controlled baselines, separating robust slice-discriminative evidence from stronger task-general causal-circuit claims.

Keywords

graph generation graph representation learning interpretable machine learning

Cite

@article{arxiv.2605.06480,
  title  = {Patch-Effect Graph Kernels for LLM Interpretability},
  author = {Ruben Fernandez-Boullon and David N. Olivieri},
  journal= {arXiv preprint arXiv:2605.06480},
  year   = {2026}
}

Patch-Effect Graph Kernels for LLM Interpretability

Abstract

Keywords

Cite

Related papers