English

Machine-Learning-Powered Specification Testing in Linear Instrumental Variable Models

Methodology 2026-04-21 v3

Abstract

The linear instrumental variable (IV) model is widely used in observational studies, yet its validity hinges on strong assumptions. Classical specification tests such as the Sargan-Hansen J test are limited to overidentified settings and are therefore not applicable in the common just-identified case, where the number of instruments is equal to the number of endogenous variables. We propose a novel test for the well-specification of the linear IV model under the assumption that the structural error is mean independent of the instruments. This assumption enables specification testing even in the just-identified setting. Our approach uses the idea of residual prediction: if the two-stage least squares residuals can be predicted from the instruments better than chance, this indicates misspecification. The resulting test employs sample splitting and a user-chosen machine learning method, and we show asymptotic type I error control and consistency against a broad class of alternatives. We further show how the proposed testing principle can be adapted to settings with weak or many instruments via an Anderson-Rubin-type inversion, thereby substantially extending the applicability. The tests accommodate heteroskedasticity- and cluster-robust inference and are implemented in the R package RPIV and the ivmodels software package for Python.

Keywords

Cite

@article{arxiv.2506.12771,
  title  = {Machine-Learning-Powered Specification Testing in Linear Instrumental Variable Models},
  author = {Cyrill Scheidegger and Malte Londschien and Peter Bühlmann},
  journal= {arXiv preprint arXiv:2506.12771},
  year   = {2026}
}