Related papers: Towards Next Generation Data Engineering Pipelines

Towards Evolution Capabilities in Data Pipelines

Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance,…

Databases · Computer Science 2025-07-29 Kevin M. Kramer

A Primer on the Data Cleaning Pipeline

The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this…

Databases · Computer Science 2023-07-26 Rebecca C. Steorts

A Survey of Pipeline Tools for Data Engineering

Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion…

Machine Learning · Computer Science 2024-06-21 Anthony Mbata , Yaji Sripada , Mingjun Zhong

Autonomous Data Processing using Meta-Agents

Traditional data processing pipelines are typically static and handcrafted for specific tasks, limiting their adaptability to evolving requirements. While general-purpose agents and coding assistants can generate code for well-understood…

Artificial Intelligence · Computer Science 2026-02-20 Udayan Khurana

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment tuning as a…

Computation and Language · Computer Science 2026-05-27 Hwanjun Song

Progressive Data Science: Potential and Challenges

Data science requires time-consuming iterative manual activities. In particular, activities such as data selection, preprocessing, transformation, and mining, highly depend on iterative trial-and-error processes that could be sped-up…

Human-Computer Interaction · Computer Science 2019-09-13 Cagatay Turkay , Nicola Pezzotti , Carsten Binnig , Hendrik Strobelt , Barbara Hammer , Daniel A. Keim , Jean-Daniel Fekete , Themis Palpanas , Yunhai Wang , Florin Rusu

The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling,…

Software Engineering · Computer Science 2022-02-15 Sumon Biswas , Mohammad Wardat , Hridesh Rajan

Energy Profiling of Data-Sharing Pipelines: Modeling, Estimation, and Reuse Strategies

Data-sharing pipelines involve a series of stages that apply policy-based data transformations to enable secure and effective data exchange among organizations. Although numerous tools and platforms exist to manage governance and…

Databases · Computer Science 2025-12-05 Sepideh Masoudi , Sebastian Werner , Pierluigi Plebani , Stefan Tai

Adaptive Neural Networks for Intelligent Data-Driven Development

Advances in machine learning methods for computer vision tasks have led to their consideration for safety-critical applications like autonomous driving. However, effectively integrating these methods into the automotive development…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Youssef Shoeb , Azarm Nowzad , Hanno Gottschalk

Optimization Opportunities for Cloud-Based Data Pipeline Infrastructures

Cloud infrastructure supports the efficient operation of data pipelines regarding requirements like cost, speed, and resource utilization. We present an integrated view of optimization opportunities for cloud-based data pipelines by…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-03 Johannes Jablonski , Georg-Daniel Schwarz , Philip Heltweg , Dirk Riehle

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"

Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and…

Databases · Computer Science 2024-05-01 Stefan Grafberger , Paul Groth , Sebastian Schelter

PRE-Share Data: Assistance Tool for Resource-aware Designing of Data-sharing Pipelines

Data is a valuable asset, and sharing it as a product across organizations is key to building comprehensive and useful insights in fields such as science and industry. Before sharing, data often requires transformation to comply with…

Social and Information Networks · Computer Science 2025-03-18 Sepideh Masoudi

Optimizing Compiler for Engineering Problems

New information technologies provide a lot of prospects for performance improvement. One of them is "Dynamic Source Code Generation and Compilation". This article shows how this way provides high performance for engineering problems.

Performance · Computer Science 2008-08-25 Petr R. Ivankov

Two-stage Optimization for Machine Learning Workflow

Machines learning techniques plays a preponderant role in dealing with massive amount of data and are employed in almost every possible domain. Building a high quality machine learning model to be deployed in production is a challenging…

Machine Learning · Computer Science 2019-07-02 Alexandre Quemy

Automated Planning for Optimal Data Pipeline Instantiation

Data pipeline frameworks provide abstractions for implementing sequences of data-intensive transformation operators, automating the deployment and execution of such transformations in a cluster. Deploying a data pipeline, however, requires…

Artificial Intelligence · Computer Science 2026-01-13 Leonardo Rosa Amado , Adriano Vogel , Dalvan Griebler , Gabriel Paludo Licks , Eric Simon , Felipe Meneguzzi

Data Pipeline Quality: Influencing Factors, Root Causes of Data-related Issues, and Processing Problem Areas for Developers

Data pipelines are an integral part of various modern data-driven systems. However, despite their importance, they are often unreliable and deliver poor-quality data. A critical step toward improving this situation is a solid understanding…

Software Engineering · Computer Science 2023-09-14 Harald Foidl , Valentina Golendukhina , Rudolf Ramler , Michael Felderer

Study on emerging applications on data plane and optimization possibilities

By programming both the data plane and the control plane, network operators can adapt their networks to their needs. Thanks to research over the past decade, this concept has more formulized and more technologically feasible. However, since…

Networking and Internet Architecture · Computer Science 2022-04-22 Gereltsetseg Altangerel , Tejfel Mate

Synthetic 3D Data Generation Pipeline for Geometric Deep Learning in Architecture

With the growing interest in deep learning algorithms and computational design in the architectural field, the need for large, accessible and diverse architectural datasets increases. We decided to tackle this problem by constructing a…

Computer Vision and Pattern Recognition · Computer Science 2021-07-08 Stanislava Fedorova , Alberto Tono , Meher Shashwat Nigam , Jiayao Zhang , Amirhossein Ahmadnia , Cecilia Bolognesi , Dominik L. Michels

A Data-Centric Optimization Framework for Machine Learning

Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize performance optimization to patterns in popular networks, they…

Machine Learning · Computer Science 2022-08-31 Oliver Rausch , Tal Ben-Nun , Nikoli Dryden , Andrei Ivanov , Shigang Li , Torsten Hoefler

Data Readiness Levels

Application of models to data is fraught. Data-generating collaborators often only have a very basic understanding of the complications of collating, processing and curating data. Challenges include: poor data collection practices, missing…

Databases · Computer Science 2017-05-08 Neil D. Lawrence