English

Data Virtualization for Machine Learning

Software Engineering 2025-09-19 v1 Machine Learning

Abstract

Nowadays, machine learning (ML) teams have multiple concurrent ML workflows for different applications. Each workflow typically involves many experiments, iterations, and collaborative activities and commonly takes months and sometimes years from initial data wrangling to model deployment. Organizationally, there is a large amount of intermediate data to be stored, processed, and maintained. \emph{Data virtualization} becomes a critical technology in an infrastructure to serve ML workflows. In this paper, we present the design and implementation of a data virtualization service, focusing on its service architecture and service operations. The infrastructure currently supports six ML applications, each with more than one ML workflow. The data virtualization service allows the number of applications and workflows to grow in the coming years.

Keywords

Cite

@article{arxiv.2507.17293,
  title  = {Data Virtualization for Machine Learning},
  author = {Saiful Khan and Joyraj Chakraborty and Philip Beaucamp and Niraj Bhujel and Min Chen},
  journal= {arXiv preprint arXiv:2507.17293},
  year   = {2025}
}