English

ScriptoriumWS: A Code Generation Assistant for Weak Supervision

Machine Learning 2025-02-19 v1

Abstract

Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.

Keywords

Cite

@article{arxiv.2502.12366,
  title  = {ScriptoriumWS: A Code Generation Assistant for Weak Supervision},
  author = {Tzu-Heng Huang and Catherine Cao and Spencer Schoenberg and Harit Vishwakarma and Nicholas Roberts and Frederic Sala},
  journal= {arXiv preprint arXiv:2502.12366},
  year   = {2025}
}

Comments

Appeared in ICLR'23 Deep Learning for Code (DL4C) Workshop & 2023 Midwest Machine Learning Symposium

R2 v1 2026-06-28T21:48:00.578Z