Related papers: High Performance Data Engineering Everywhere
In the current era of Big Data, data engineering has transformed into an essential field of study across many branches of science. Advancements in Artificial Intelligence (AI) have broadened the scope of data engineering and opened up new…
Data is found everywhere, from health and human infrastructure to the surge of sensors and the proliferation of internet-connected devices. To meet this challenge, the data engineering field has expanded significantly in recent years in…
The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…
Data pre-processing is a fundamental component in any data-driven application. With the increasing complexity of data processing operations and volume of data, Cylon, a distributed dataframe system, is developed to facilitate data…
The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to…
Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation,…
Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate…
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more…
Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer…
Programming modern high-performance computing systems is challenging due to the need to efficiently program GPUs and accelerators and to handle data movement between nodes. The C++ language has been continuously enhanced in recent years…
Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python…
Python has become the prime language for application development in the Data Science and Machine Learning domains. However, data scientists are not necessarily experienced programmers. While Python lets them quickly implement their…
The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for…
Current trends point to a future where large-scale scientific applications are tightly-coupled HPC/AI hybrids. Hence, we urgently need to invest in creating a seamless, scalable framework where HPC and AI/ML can efficiently work together…
This paper introduces Sparklen, a statistical learning toolkit for Hawkes processes in Python, designed to bring together efficiency and ease of use. The purpose of this package is to provide the Python community with a complete suite of…
With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…
With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows,…
Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering…
In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only…
Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient…