Related papers: Large-scale text processing pipeline with Apache S…

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

Declarative Data Pipeline for Large Scale ML Services

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-07 Yunzhao Yang , Runhui Wang , Xuanqing Liu , Adit Krishnan , Yefan Tao , Yuqian Deng , Kuangyou Yao , Peiyuan Sun , Henrik Johnson , Aditi sinha , Davor Golac , Gerald Friedland , Usman Shakeel , Daryl Cooke , Joe Sullivan , Madhusudhanan Chandrasekaran , Chris Kong

High-Dimensional Data Processing: Benchmarking Machine Learning and Deep Learning Architectures in Local and Distributed Environments

This document reports the sequence of practices and methodologies implemented during the Big Data course. It details the workflow beginning with the processing of the Epsilon dataset through group and individual strategies, followed by text…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-12 Julian Rodriguez , Piotr Lopez , Emiliano Lerma , Rafael Medrano , Jacobo Hernandez

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Real-time Text Analytics Pipeline Using Open-source Big Data Tools

Real-time text processing systems are required in many domains to quickly identify patterns, trends, sentiments, and insights. Nowadays, social networks, e-commerce stores, blogs, scientific experiments, and server logs are main sources…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-12-13 Hassan Nazeer , Waheed Iqbal , Fawaz Bokhari , Faisal Bukhari , Shuja Ur Rehman Baig

A Survey of State Management in Big Data Processing Systems

State management and its use in diverse applications varies widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Samza, Apache Spark, and Apache Storm.…

Databases · Computer Science 2018-08-02 Quoc-Cuong To , Juan Soto , Volker Markl

Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds

Big data processing is a hot topic in today's computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-17 Shelan Perera , Ashansa Perera , Kamal Hakimzadeh

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

Querying very large RDF data sets in an efficient manner requires a sophisticated distribution strategy. Several innovative solutions have recently been proposed for optimizing data distribution with predefined query workloads. This paper…

Databases · Computer Science 2015-07-10 Olivier Curé , Hubert Naacke , Mohamed-Amine Baazizi , Bernd Amann

Distributed Streaming Analytics on Large-scale Oceanographic Data using Apache Spark

Real-world data from diverse domains require real-time scalable analysis. Large-scale data processing frameworks or engines such as Hadoop fall short when results are needed on-the-fly. Apache Spark's streaming library is increasingly…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-02 Janak Dahal , Elias Ioup , Shaikh Arifuzzaman , Mahdi Abdelguerfi

A Spark ML driven preprocessing approach for deep learning based scholarly data applications

Big data has found applications in multiple domains. One of the largest sources of textual big data is scientific documents and papers. Big scholarly data have been used in numerous ways to create innovative applications such as…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-19 Samiya Khan , Xiufeng Liu , Mansaf Alam

Mining Area Skyline Objects from Map-based Big Data using Apache Spark Framework

The computation of the skyline provides a mechanism for utilizing multiple location-based criteria to identify optimal data points. However, the efficiency of these computations diminishes and becomes more challenging as the input data…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-05 Chen Li , Ye Zhu , Yang Cao , Jinli Zhang , Annisa Annisa , Debo Cheng , Yasuhiko Morimoto

Imbalanced Big Data Oversampling: Taxonomy, Algorithms, Software, Guidelines and Future Directions

Learning from imbalanced data is among the most challenging areas in contemporary machine learning. This becomes even more difficult when considered the context of big data that calls for dedicated architectures capable of high-performance…

Machine Learning · Computer Science 2022-11-16 William C. Sleeman , Bartosz Krawczyk

Scalable Formal Concept Analysis algorithm for large datasets using Spark

In the process of knowledge discovery and representation in large datasets using formal concept analysis, complexity plays a major role in identifying all the formal concepts and constructing the concept lattice(digraph of the concepts).…

Artificial Intelligence · Computer Science 2018-07-09 Raghavendra K Chunduri , Aswani Kumar Cherukuri

InferSpark: Statistical Inference at Scale

The Apache Spark stack has enabled fast large-scale data processing. Despite a rich library of statistical models and inference algorithms, it does not give domain users the ability to develop their own models. The emergence of…

Databases · Computer Science 2017-10-10 Zhuoyue Zhao , Jialing Pei , Eric Lo , Kenny Q. Zhu , Chris Liu

Nuova frontiera della classificazione testuale: Big data e calcolo distribuito

This document was created in order to study the algorithms for the categorization of phrases and rank them using the facilities provided by the framework Apache Spark. Starting from the study illustrated in the publication "Classifying…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-22 Marco Covelli , Massimiliano Morrelli

Benchmarking Distributed Stream Data Processing Systems

The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to…

Databases · Computer Science 2019-06-27 Jeyhun Karimov , Tilmann Rabl , Asterios Katsifodimos , Roman Samarev , Henri Heiskanen , Volker Markl

Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark

In this paper we explore the performance limits of Apache Spark for machine learning applications. We begin by analyzing the characteristics of a state-of-the-art distributed machine learning algorithm implemented in Spark and compare it to…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-21 Celestine Dünner , Thomas Parnell , Kubilay Atasu , Manolis Sifalakis , Haralampos Pozidis

Large-Scale Network Embedding in Apache Spark

Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that…

Social and Information Networks · Computer Science 2025-10-30 Wenqing Lin

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-17 Matteo Migliorini , Riccardo Castellotti , Luca Canali , Marco Zanetti