English

Semiparametric Efficient Data Integration Using the Dual-Frame Sampling Framework

Methodology 2026-01-14 v1 Statistics Theory Statistics Theory

Abstract

Integrating probability and non-probability samples is increasingly important, yet unknown sampling mechanisms in non-probability sources complicate identification and efficient estimation. We develop semiparametric theory for dual-frame data integration and propose two complementary estimators. The first models the non-probability inclusion probability parametrically and attains the semiparametric efficiency bound. We introduce an identifiability condition based on strong monotonicity that identifies sampling-model parameters without instrumental variables, even under informative (non-ignorable) selection, using auxiliary information from the probability sample; it remains valid without record linkage between samples. The second estimator, motivated by a two-stage sampling approximation, avoids explicit modeling of the non-probability mechanism; though not fully efficient, it is efficient within a restricted augmentation class and is robust to misspecification. Simulations and an application to the Culture and Community in a Time of Crisis public simulation dataset show efficiency gains under correct specification and stable performance under misspecification and weak identification. Methods are implemented in the R package \texttt{dfSEDI}.

Keywords

Cite

@article{arxiv.2601.08707,
  title  = {Semiparametric Efficient Data Integration Using the Dual-Frame Sampling Framework},
  author = {Kosuke Morikawa and Jae Kwang Kim},
  journal= {arXiv preprint arXiv:2601.08707},
  year   = {2026}
}