Related papers: Exploiting Apache Spark platform for CMS computing…

Using Big Data Technologies for HEP Analysis

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-02 Matteo Cremonesi , Claudio Bellini , Bianny Bian , Luca Canali , Vasileios Dimakopoulos , Peter Elmer , Ian Fisk , Maria Girone , Oliver Gutsche , Siew-Yan Hoh , Bo Jayatilaka , Viktor Khristenko , Andrea Luiselli , Andrew Melo , Evangelos Evangelos , Dominick Olivito , Jacopo Pazzini , Jim Pivarski , Alexey Svyatkovskiy , Marco Zanetti

CMS Analysis and Data Reduction with Apache Spark

Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-02 Oliver Gutsche , Luca Canali , Illia Cremer , Matteo Cremonesi , Peter Elmer , Ian Fisk , Maria Girone , Bo Jayatilaka , Jim Kowalkowski , Viktor Khristenko , Evangelos Motesnitsalis , Jim Pivarski , Saba Sehrish , Kacper Surdy , Alexey Svyatkovskiy

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

The archive solution for distributed workflow management agents of the CMS experiment at LHC

The CMS experiment at the CERN LHC developed the Workflow Management Archive system to persistently store unstructured framework job report documents produced by distributed workflow management agents. In this paper we present its…

High Energy Physics - Experiment · Physics 2018-01-12 Valentin Kuznetsov , Nils Leif Fischer , Yuyi Guo

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-24 Byung H. Park , Saurabh Hukerikar , Ryan Adamson , Christian Engelmann

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Performance Evaluation of Apache Spark MLlib Algorithms on an Intrusion Detection Dataset

The increase in the use of the Internet and web services and the advent of the fifth generation of cellular network technology (5G) along with ever-growing Internet of Things (IoT) data traffic will grow global internet usage. To ensure the…

Networking and Internet Architecture · Computer Science 2022-12-13 Ramin Atefinia , Mahmood Ahmadi

Advancing ATLAS DCS Data Analysis with a Modern Data Platform

This paper presents a modern and scalable framework for analyzing Detector Control System (DCS) data from the ATLAS experiment at CERN. The DCS data, stored in an Oracle database via the WinCC OA system, is optimized for transactional…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-24 Luca Canali , Andrea Formica , Michelle Ann Solis

Identifying the potential of Near Data Computing for Apache Spark

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-31 Ahsan Javed Awan , Mats Brorsson , Vladimir Vlassov , Eduard Ayguade

Technical Report: On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

Deep Learning with Apache SystemML

Enterprises operate large data lakes using Hadoop and Spark frameworks that (1) run a plethora of tools to automate powerful data preparation/transformation pipelines, (2) run on shared, large clusters to (3) perform many different…

Machine Learning · Computer Science 2018-02-14 Niketan Pansare , Michael Dusenberry , Nakul Jindal , Matthias Boehm , Berthold Reinwald , Prithviraj Sen

Sparkle: Optimizing Spark for Large Memory Machines and Analytics

Spark is an in-memory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-22 Mijung Kim , Jun Li , Haris Volos , Manish Marwah , Alexander Ulanov , Kimberly Keeton , Joseph Tucek , Lucy Cherkasova , Le Xu , Pradeep Fernando

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-23 Taha Tekdogan , Ali Cakmak

FITS Data Source for Apache Spark

We investigate the performance of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but…

Instrumentation and Methods for Astrophysics · Physics 2018-10-17 Julien Peloton , Christian Arnault , Stéphane Plaszczynski

The CMS monitoring infrastructure and applications

The globally distributed computing infrastructure required to cope with the multi-petabytes datasets produced by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) at CERN comprises several subsystems, such as…

Software Engineering · Computer Science 2020-07-08 Christian Ariza-Porras , Valentin Kuznetsov , Federica Legger

Optimizing CMS build infrastructure via Apache Mesos

The Offline Software of the CMS Experiment at the Large Hadron Collider (LHC) at CERN consists of 6M lines of in-house code, developed over a decade by nearly 1000 physicists, as well as a comparable amount of general use open-source code.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-01-20 David Abdurachmanov , Alessandro Degano , Peter Elmer , Giulio Eulisse , David Mendez , Shahzad Muzaffar

Analyzing billion-objects catalog interactively: Apache Spark for physicists

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is…

Instrumentation and Methods for Astrophysics · Physics 2019-07-17 S. Plaszczynski , J. Peloton , C. Arnault , J. E. Campagne

Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-28 Fatemeh Rouzbeh , Ananth Grama , Paul Griffin , Mohammad Adibuzzaman

Improving Spark Application Throughput Via Memory Aware Task Co-location: A Mixture of Experts Approach

Data analytic applications built upon big data processing frameworks such as Apache Spark are an important class of applications. Many of these applications are not latency-sensitive and thus can run as batch jobs in data centers. By…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-03 Vicent Sanz Marco , Ben Taylor , Barry Porter , Zheng Wang

Large-Scale Intelligent Microservices

Deploying Machine Learning (ML) algorithms within databases is a challenge due to the varied computational footprints of modern ML algorithms and the myriad of database technologies each with its own restrictive syntax. We introduce an…

Artificial Intelligence · Computer Science 2022-03-17 Mark Hamilton , Nick Gonsalves , Christina Lee , Anand Raman , Brendan Walsh , Siddhartha Prasad , Dalitso Banda , Lucy Zhang , Mei Gao , Lei Zhang , William T. Freeman