Related papers: Optimizing ETL Dataflow Using Shared Caching and P…

Mathematical Foundations of Modeling ETL Process Chains

Extract-Transform-Load (ETL) processes are core components of modern data processing infrastructures. The throughput of processed data records can be adjusted by changing the amount of allocated resources, i.e.~the number of parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-01 Levin Maier , Lucas Schulze , Robert Lilow , Lukas Hahn , Nikola Krasowski , Arnulf Barth , Sebastian Gaebel , Ferdi Güran , Oliver Hanau , Giovanni Wagner , Falk Borgmann , Oleg Arenz , Jan Peters

FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering

The Extract, Transform, Load (ETL) workflow is fundamental for populating and maintaining data warehouses and other data stores accessed by analysts for downstream tasks. A major shortcoming of modern ETL solutions is the extensive need for…

Software Engineering · Computer Science 2025-08-01 Mattia Di Profio , Mingjun Zhong , Yaji Sripada , Marcel Jaspars

A TTL-based Approach for Content Placement in Edge Networks

Edge networks are promising to provide better services to users by provisioning computing and storage resources at the edge of networks. However, due to the uncertainty and diversity of user interests, content popularity, distributed…

Networking and Internet Architecture · Computer Science 2020-03-16 Nitish K. Panigrahy , Jian Li , Faheem Zafari , Don Towsley , Paul Yu

Push Down Optimization for Distributed Multi Cloud Data Integration

Enterprises increasingly adopt multi cloud architectures to take advantage of diverse database engines, regional availability, and cost models. In these environments, ETL pipelines must process large, distributed datasets while minimizing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Ravi Kiran Kodali , Vinoth Punniyamoorthy , Akash Kumar Agarwal , Bikesh Kumar , Balakrishna Pothineni , Aswathnarayan Muthukrishnan Kirubakaran , Sumit Saha , Nachiappan Chockalingam

Two-level Data Staging ETL for Transaction Data

In data warehousing, Extract-Transform-Load (ETL) extracts the data from data sources into a central data warehouse regularly for the support of business decision-makings. The data from transaction processing systems are featured with the…

Databases · Computer Science 2014-09-16 Xiufeng Liu

Spinning Fast Iterative Data Flows

Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk…

Databases · Computer Science 2012-08-02 Stephan Ewen , Kostas Tzoumas , Moritz Kaufmann , Volker Markl

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

KV cache restoration has emerged as a dominant bottleneck in serving long-context LLM workloads, including multi-turn conversations, retrieval-augmented generation, and agentic pipelines. Existing approaches treat restoration as a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-29 Sean Nian , Jiahao Fang , Qilong Feng , Zhiyu Wu , Fan Lai

Deep Q-Learning-Based Intelligent Scheduling for ETL Optimization in Heterogeneous Data Environments

This paper addresses the challenges of low scheduling efficiency, unbalanced resource allocation, and poor adaptability in ETL (Extract-Transform-Load) processes under heterogeneous data environments by proposing an intelligent scheduling…

Machine Learning · Computer Science 2025-12-16 Kangning Gao , Yi Hu , Cong Nie , Wei Li

Data Partitioning for Parallel Entity Matching

Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-06-29 Toralf Kirsten , Lars Kolb , Michael Hartung , Anika Groß , Hanna Köpcke , Erhard Rahm

Automatic Task Parallelization of Dataflow Graphs in ML/DL models

Several methods exist today to accelerate Machine Learning(ML) or Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search…

Machine Learning · Computer Science 2023-08-23 Srinjoy Das , Lawrence Rauchwerger

Data Extraction, Transformation, and Loading Process Automation for Algorithmic Trading Machine Learning Modelling and Performance Optimization

A data warehouse efficiently prepares data for effective and fast data analysis and modelling using machine learning algorithms. This paper discusses existing solutions for the Data Extraction, Transformation, and Loading (ETL) process and…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-12-21 Nassi Ebadifard , Ajitesh Parihar , Youry Khmelevsky , Gaetan Hains , Albert Wong , Frank Zhang

Parallelizing Query Optimization on Shared-Nothing Architectures

Data processing systems offer an ever increasing degree of parallelism on the levels of cores, CPUs, and processing nodes. Query optimization must exploit high degrees of parallelism in order not to gradually become the bottleneck of query…

Databases · Computer Science 2015-11-06 Immanuel Trummer , Christoph Koch

An Optimized Multi-Layer Resource Management in Mobile Edge Computing Networks: A Joint Computation Offloading and Caching Solution

Nowadays, data caching is being used as a high-speed data storage layer in mobile edge computing networks employing flow control methodologies at an exponential rate. This study shows how to discover the best architecture for backhaul…

Networking and Internet Architecture · Computer Science 2022-11-29 Amir Ziaeddini , Amin Mohajer , Davoud Yousefi , A. Mirzaei , Shu Gonglee

Cache-based Multi-query Optimization for Data-intensive Scalable Computing Frameworks

In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in…

Databases · Computer Science 2018-05-23 Pietro Michiardi , Damiano Carra , Sara Migliorini

DOD-ETL: Distributed On-Demand ETL for Near Real-Time Business Intelligence

The competitive dynamics of the globalized market demand information on the internal and external reality of corporations. Information is a precious asset and is responsible for establishing key advantages to enable companies to maintain…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-17 Gustavo V. Machado , Ítalo Cunha , Adriano C. M. Pereira , Leonardo B. Oliveira

Co-Optimizing Cache Partitioning and Multi-Core Task Scheduling: Exploit Cache Sensitivity or Not?

Cache partitioning techniques have been successfully adopted to mitigate interference among concurrently executing real-time tasks on multi-core processors. Considering that the execution time of a cache-sensitive task strongly depends on…

Hardware Architecture · Computer Science 2023-10-05 Binqi Sun , Debayan Roy , Tomasz Kloda , Andrea Bastoni , Rodolfo Pellizzoni , Marco Caccamo

Improvement Cache Efficiency of Explicit Finite Element Procedure and its Application to Parallel Casting Solidification Simulation

A simple method for improving cache efficiency of serial and parallel explicit finite procedure with application to casting solidification simulation over three-dimensional complex geometries is presented. The method is based on division of…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-05-19 Ruhollah Tavakoli

TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving

Prefix caching is crucial to accelerate multi-turn interactions and requests with shared prefixes. At the cluster level, existing prefix caching systems are tightly coupled with request scheduling to optimize cache efficiency and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-26 Bingyang Wu , Zili Zhang , Yinmin Zhong , Guanzhe Huang , Yibo Zhu , Xuanzhe Liu , Xin Jin

Efficient Task Grouping Through Samplewise Optimisation Landscape Analysis

Shared training approaches, such as multi-task learning (MTL) and gradient-based meta-learning, are widely used in various machine learning applications, but they often suffer from negative transfer, leading to performance degradation in…

Machine Learning · Computer Science 2024-12-10 Anshul Thakur , Yichen Huang , Soheila Molaei , Yujiang Wang , David A. Clifton

Formalizing ETLT and ELTL Design Patterns and Proposing Enhanced Variants: A Systematic Framework for Modern Data Engineering

Traditional ETL and ELT design patterns struggle to meet modern requirements of scalability, governance, and real-time data processing. Hybrid approaches such as ETLT (Extract-Transform-Load-Transform) and ELTL (Extract-Load-Transform-Load)…

Databases · Computer Science 2025-11-06 Chiara Rucco , Motaz Saad , Antonella Longo