English

Enabling Collaborative Data Science Development with the Ballet Framework

Machine Learning 2021-10-26 v5 Human-Computer Interaction Software Engineering

Abstract

While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML performance evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects.

Keywords

Cite

@article{arxiv.2012.07816,
  title  = {Enabling Collaborative Data Science Development with the Ballet Framework},
  author = {Micah J. Smith and Jürgen Cito and Kelvin Lu and Kalyan Veeramachaneni},
  journal= {arXiv preprint arXiv:2012.07816},
  year   = {2021}
}
R2 v1 2026-06-23T20:57:53.965Z