Weak supervision is a popular framework for overcoming the labeled data bottleneck: the need to obtain labels for training data. In weak supervision, multiple noisy-but-cheap sources are used to provide guesses of the label and are aggregated to produce high-quality pseudolabels. These sources are often expressed as small programs written by domain experts -- and so are expensive to obtain. Instead, we argue for using code-generation models to act as coding assistants for crafting weak supervision sources. We study prompting strategies to maximize the quality of the generated sources, settling on a multi-tier strategy that incorporates multiple types of information. We explore how to best combine hand-written and generated sources. Using these insights, we introduce ScriptoriumWS, a weak supervision system that, when compared to hand-crafted sources, maintains accuracy and greatly improves coverage.
@article{arxiv.2502.12366,
title = {ScriptoriumWS: A Code Generation Assistant for Weak Supervision},
author = {Tzu-Heng Huang and Catherine Cao and Spencer Schoenberg and Harit Vishwakarma and Nicholas Roberts and Frederic Sala},
journal= {arXiv preprint arXiv:2502.12366},
year = {2025}
}
Comments
Appeared in ICLR'23 Deep Learning for Code (DL4C) Workshop & 2023 Midwest Machine Learning Symposium