Related papers: High Performance Data Engineering Everywhere

A Fast, Scalable, Universal Approach For Distributed Data Aggregations

In the current era of Big Data, data engineering has transformed into an essential field of study across many branches of science. Advancements in Artificial Intelligence (AI) have broadened the scope of data engineering and opened up new…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-16 Niranda Perera , Vibhatha Abeykoon , Chathura Widanage , Supun Kamburugamuve , Thejaka Amila Kanewala , Pulasthi Wickramasinghe , Ahmet Uyar , Hasara Maithree , Damitha Lenadora , Geoffrey Fox

Combining Serverless and High-Performance Computing Paradigms to support ML Data-Intensive Applications

Data is found everywhere, from health and human infrastructure to the surge of sensors and the proliferation of internet-connected devices. To meet this challenge, the data engineering field has expanded significantly in recent years in…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-06 Mills Staylor , Arup Kumar Sarker , Gregor von Laszewski , Geoffrey Fox , Yue Cheng , Judy Fox

High Performance Dataframes from Parallel Processing Patterns

The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-06 Niranda Perera , Supun Kamburugamuve , Chathura Widanage , Vibhatha Abeykoon , Ahmet Uyar , Kaiying Shan , Hasara Maithree , Damitha Lenadora , Thejaka Amila Kanewala , Geoffrey Fox

Hybrid Cloud and HPC Approach to High-Performance Dataframes

Data pre-processing is a fundamental component in any data-driven application. With the increasing complexity of data processing operations and volume of data, Cylon, a distributed dataframe system, is developed to facilitate data…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-02 Kaiying Shan , Niranda Perera , Damitha Lenadora , Tianle Zhong , Arup Sarker , Supun Kamburugamuve , Thejaka Amila Kanewela , Chathura Widanage , Geoffrey Fox

Supercharging Distributed Computing Environments For High Performance Data Engineering

The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-20 Niranda Perera , Kaiying Shan , Supun Kamburugamuwe , Thejaka Amila Kanewela , Chathura Widanage , Arup Sarker , Mills Staylor , Tianle Zhong , Vibhatha Abeykoon , Geoffrey Fox

Data Engineering for HPC with Python

Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-14 Vibhatha Abeykoon , Niranda Perera , Chathura Widanage , Supun Kamburugamuve , Thejaka Amila Kanewala , Hasara Maithree , Pulasthi Wickramasinghe , Ahmet Uyar , Geoffrey Fox

Design and Implementation of an Analysis Pipeline for Heterogeneous Data

Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-09 Arup Kumar Sarker , Aymen Alsaadi , Niranda Perera , Mills Staylor , Gregor von Laszewski , Matteo Turilli , Ozgur Ozan Kilic , Mikhail Titov , Andre Merzky , Shantenu Jha , Geoffrey Fox

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-06 Niranda Perera , Arup Kumar Sarker , Mills Staylor , Gregor von Laszewski , Kaiying Shan , Supun Kamburugamuve , Chathura Widanage , Vibhatha Abeykoon , Thejaka Amila Kanewela , Geoffrey Fox

Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Arup Kumar Sarker , Aymen Alsaadi , Alexander James Halpern , Prabhath Tangella , Mikhail Titov , Niranda Perera , Mills Staylor , Gregor von Laszewski , Shantenu Jha , Geoffrey Fox

Toward Heterogeneous, Distributed, and Energy-Efficient Computing with SYCL

Programming modern high-performance computing systems is challenging due to the need to efficiently program GPUs and accelerators and to handle data movement between nodes. The C++ language has been continuously enhanced in recent years…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Biagio Cosenza , Lorenzo Carpentieri , Kaijie Fan , Marco D'Antonio , Peter Thoman , Philip Salzmann

Productivity, Portability, Performance: Data-Centric Python

Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python…

Programming Languages · Computer Science 2021-08-24 Alexandros Nikolaos Ziogas , Timo Schneider , Tal Ben-Nun , Alexandru Calotoiu , Tiziano De Matteis , Johannes de Fine Licht , Luca Lavarini , Torsten Hoefler

Landscape of High-performance Python to Develop Data Science and Machine Learning Applications

Python has become the prime language for application development in the Data Science and Machine Learning domains. However, data scientists are not necessarily experienced programmers. While Python lets them quickly implement their…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-24 Oscar Castro , Pierrick Bruneau , Jean-Sébastien Sottet , Dario Torregrossa

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for…

Programming Languages · Computer Science 2026-05-06 Size Zheng , Xuegui Zheng , Hanshi Sun , Qi Hou , Wenlei Bao , Shiyu Li , Haojie Duanmu , Jin Fang , Chenli Xue , Chenhui Huang , Yuanqiang Liu , Renze Chen , Ningxin Zheng , Dongyang Wang , Li-Wen Chang , Liqiang Lu , Yun Liang , Jidong Zhai , Xin Liu

A Unifying Framework to Enable Artificial Intelligence in High Performance Computing Workflows

Current trends point to a future where large-scale scientific applications are tightly-coupled HPC/AI hybrids. Hence, we urgently need to invest in creating a seamless, scalable framework where HPC and AI/ML can efficiently work together…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Jens Domke , Mohamed Wahib , Anshu Dubey , Tal Ben-Nun , Erik W. Draeger

Sparklen: A Statistical Learning Toolkit for High-Dimensional Hawkes Processes in Python

This paper introduces Sparklen, a statistical learning toolkit for Hawkes processes in Python, designed to bring together efficiency and ease of use. The purpose of this package is to provide the Python community with a complete suite of…

Methodology · Statistics 2025-03-31 Romain Edmond Lacoste

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Enabling Machine Learning-Ready HPC Ensembles with Merlin

With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-05 J. Luc Peterson , Ben Bay , Joe Koning , Peter Robinson , Jessica Semler , Jeremy White , Rushil Anirudh , Kevin Athey , Peer-Timo Bremer , Francesco Di Natale , David Fox , Jim A. Gaffney , Sam A. Jacobs , Bhavya Kailkhura , Bogdan Kustowski , Steven Langer , Brian Spears , Jayaraman Thiagarajan , Brian Van Essen , Jae-Seung Yeom

HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-02 Supun Kamburugamuve , Chathura Widanage , Niranda Perera , Vibhatha Abeykoon , Ahmet Uyar , Thejaka Amila Kanewala , Gregor von Laszewski , Geoffrey Fox

A Comparison of Big Data Frameworks on a Layered Dataflow Model

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-17 Claudia Misale , Maurizio Drocco , Marco Aldinucci , Guy Tremblay

HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-16 Vibhatha Abeykoon , Supun Kamburugamuve , Chathura Widanage , Niranda Perera , Ahmet Uyar , Thejaka Amila Kanewala , Gregor von Laszewski , Geoffrey Fox