Related papers: Using Big Data Technologies for HEP Analysis

CMS Analysis and Data Reduction with Apache Spark

Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-02 Oliver Gutsche , Luca Canali , Illia Cremer , Matteo Cremonesi , Peter Elmer , Ian Fisk , Maria Girone , Bo Jayatilaka , Jim Kowalkowski , Viktor Khristenko , Evangelos Motesnitsalis , Jim Pivarski , Saba Sehrish , Kacper Surdy , Alexey Svyatkovskiy

Big Data in HEP: A comprehensive use case study

Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-23 Oliver Gutsche , Matteo Cremonesi , Peter Elmer , Bo Jayatilaka , Jim Kowalkowski , Jim Pivarski , Saba Sehrish , Cristina Mantilla Surez , Alexey Svyatkovskiy , Nhan Tran

Exploiting Apache Spark platform for CMS computing analytics

The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing…

Data Analysis, Statistics and Probability · Physics 2017-11-03 Marco Meoni , Valentin Kuznetsov , Luca Menichetti , Justinas Rumševičius , Tommaso Boccali , Daniele Bonacorsi

Toward real-time data query systems in HEP

Exploratory data analysis tools must respond quickly to a user's questions, so that the answer to one question (e.g. a visualized histogram or fit) can influence the next. In some SQL-based query systems used in industry, even very large…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-09 Jim Pivarski , David Lange , Thanat Jatuphattharachat

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

Gaining insight from large data volumes with ease

Efficient handling of large data-volumes becomes a necessity in today's world. It is driven by the desire to get more insight from the data and to gain a better understanding of user trends which can be transformed into economic incentives…

Data Analysis, Statistics and Probability · Physics 2019-10-02 Valentin Kuznetsov

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-17 Matteo Migliorini , Riccardo Castellotti , Luca Canali , Marco Zanetti

HEP Software Foundation Community White Paper Working Group - Data Analysis and Interpretation

At the heart of experimental high energy physics (HEP) is the development of facilities and instrumentation that provide sensitivity to new phenomena. Our understanding of nature at its most fundamental level is advanced through the…

Computational Physics · Physics 2018-04-12 Lothar Bauerdick , Riccardo Maria Bianchi , Brian Bockelman , Nuno Castro , Kyle Cranmer , Peter Elmer , Robert Gardner , Maria Girone , Oliver Gutsche , Benedikt Hegner , José M. Hernández , Bodhitha Jayatilaka , David Lange , Mark S. Neubauer , Daniel S. Katz , Lukasz Kreczko , James Letts , Shawn McKee , Christoph Paus , Kevin Pedro , Jim Pivarski , Martin Ritter , Eduardo Rodrigues , Tai Sakuma , Elizabeth Sexton-Kennedy , Michael D. Sokoloff , Carl Vuosalo , Frank Würthwein , Gordon Watts

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-24 Byung H. Park , Saurabh Hukerikar , Ryan Adamson , Christian Engelmann

Analyzing billion-objects catalog interactively: Apache Spark for physicists

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is…

Instrumentation and Methods for Astrophysics · Physics 2019-07-17 S. Plaszczynski , J. Peloton , C. Arnault , J. E. Campagne

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-23 Taha Tekdogan , Ali Cakmak

Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-28 Fatemeh Rouzbeh , Ananth Grama , Paul Griffin , Mohammad Adibuzzaman

A Big Data Analysis Framework Using Apache Spark and Deep Learning

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Performance Benefits of DataMPI: A Case Study with BigDataBench

Apache Hadoop and Spark are gaining prominence in Big Data processing and analytics. Both of them are widely deployed on Internet companies. On the other hand, high-performance data analysis requirements are causing academical and…

Performance · Computer Science 2014-03-17 Fan Liang , Chen Feng , Xiaoyi Lu , Zhiwei Xu

HEP-Frame: an Efficient Tool for Big Data Applications at the LHC

HEP-Frame is a new C++ package designed to efficiently perform analyses of data sets from a very large number of events, like those available at the Large Hadron Collider (LHC) at CERN, Geneva. It mainly targets high performance servers and…

High Energy Physics - Experiment · Physics 2023-03-10 A. Pereira , A. Onofre , A. Proenca

The Critical Importance of Software for HEP

Particle physics has an ambitious and broad global experimental programme for the coming decades. Large investments in building new facilities are already underway or under consideration. Scaling the present processing power and data…

High Energy Physics - Experiment · Physics 2025-06-13 HEP Software Foundation , : , Christina Agapopoulou , Claire Antel , Saptaparna Bhattacharya , Steven Gardiner , Krzysztof L. Genser , James Andrew Gooding , Alexander Held , Michel Hernandez Villanueva , Michel Jouvin , Tommaso Lari , Valeriia Lukashenko , Sudhir Malik , Alexander Moreno Briceño , Stephen Mrenna , Inês Ochoa , Joseph D. Osborn , Jim Pivarski , Alan Price , Eduardo Rodrigues , Richa Sharma , Nicholas Smith , Graeme Andrew Stewart , Anna Zaborowska , Dirk Zerwas , Maarten van Veghel

Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics

Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-05 Umberto Ferraro Petrillo , Mara Sorella , Giuseppe Cattaneo , Raffaele Giancarlo , Simona Rombo

ASCR/HEP Exascale Requirements Review Report

This draft report summarizes and details the findings, results, and recommendations derived from the ASCR/HEP Exascale Requirements Review meeting held in June, 2015. The main conclusions are as follows. 1) Larger, more capable computing…

Computational Physics · Physics 2016-04-19 Salman Habib , Robert Roser , Richard Gerber , Katie Antypas , Katherine Riley , Tim Williams , Jack Wells , Tjerk Straatsma , A. Almgren , J. Amundson , S. Bailey , D. Bard , K. Bloom , B. Bockelman , A. Borgland , J. Borrill , R. Boughezal , R. Brower , B. Cowan , H. Finkel , N. Frontiere , S. Fuess , L. Ge , N. Gnedin , S. Gottlieb , O. Gutsche , T. Han , K. Heitmann , S. Hoeche , K. Ko , O. Kononenko , T. LeCompte , Z. Li , Z. Lukic , W. Mori , P. Nugent , C. -K. Ng , G. Oleynik , B. O'Shea , N. Padmanabhan , D. Petravick , F. J. Petriello , J. Power , J. Qiang , L. Reina , T. J. Rizzo , R. Ryne , M. Schram , P. Spentzouris , D. Toussaint , J. -L. Vay , B. Viren , F. Wurthwein , L. Xiao

Towards Interactive, Adaptive and Result-aware Big Data Analytics

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-15 Avinash Kumar

Data Management for Physics Analysis in Phenix (BNL, RHIC)

Every year the PHENIX collaboration deals with increasing volume of data (now about 1/4 PB/year). Apparently the more data the more questions how to process all the data in most efficient way. In recent past many developments in HEP…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Barbara Jacak , Roy Lacey , Dave Morrison , Irina Sourikova , Andrey Shevel , Qiu Zhiping