Related papers: Data Engineering for HPC with Python
The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide.…
Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient…
In the current era of Big Data, data engineering has transformed into an essential field of study across many branches of science. Advancements in Artificial Intelligence (AI) have broadened the scope of data engineering and opened up new…
Data-driven modeling is an approach in energy systems modeling that has been gaining popularity. In data-driven modeling, machine learning methods such as linear regression, neural networks or decision-tree based methods are being applied.…
While deep learning excels in natural image and language processing, its application to high-dimensional data faces computational challenges due to the dimensionality curse. Current large-scale data tools focus on business-oriented…
The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily…
Machine learning has proved to be a useful tool for extracting knowledge from scientific data in numerous research fields, including astrophysics, genomics, and molecular dynamics. Often, data sets from these research areas need to be…
PaPy, which stands for parallel pipelines in Python, is a highly flexible framework that enables the construction of robust, scalable workflows for either generating or processing voluminous datasets. A workflow is created from user-written…
Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering…
This article describes a geometric partitioning software that can be used for quick computation of data partitions on many-core HPC machines. It is most suited for dynamic applications with load distributions that vary with time.…
Data pre-processing is a fundamental component in any data-driven application. With the increasing complexity of data processing operations and volume of data, Cylon, a distributed dataframe system, is developed to facilitate data…
Python has become the prime language for application development in the Data Science and Machine Learning domains. However, data scientists are not necessarily experienced programmers. While Python lets them quickly implement their…
The ability to express a program as a hierarchical composition of parts is an essential tool in managing the complexity of software and a key abstraction this provides is to separate the representation of data from the computation. Many…
This paper presents a new technique for data slicing of distributed programs running on a hierarchy of machines. Data slicing can be realized as a program transformation that partitions heaps of machines in a hierarchy into independent…
Python has become a standard scientific computing language with fast-growing support of machine learning and data analysis modules, as well as an increasing usage of big data. The Dynamic Distributed Dimensional Data Model (D4M) offers a…
We propose a framework for training neural networks that are coupled with partial differential equations (PDEs) in a parallel computing environment. Unlike most distributed computing frameworks for deep neural networks, our focus is to…
We introduce D2O, a Python module for cluster-distributed multi-dimensional numerical arrays. It acts as a layer of abstraction between the algorithm code and the data-distribution logic. The main goal is to achieve usability without losing…
Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the…
Programmable data planes allow users to define their own data plane algorithms for network devices including appropriate data plane application programming interfaces (APIs) which may be leveraged by user-defined software-defined networking…
This chapter introduces the state-of-the-art in the emerging area of combining High Performance Computing (HPC) with Big Data Analysis. To understand the new area, the chapter first surveys the existing approaches to integrating HPC with…