English
Related papers

Related papers: CMS Analysis and Data Reduction with Apache Spark

200 papers

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets…

Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-23 Oliver Gutsche , Matteo Cremonesi , Peter Elmer , Bo Jayatilaka , Jim Kowalkowski , Jim Pivarski , Saba Sehrish , Cristina Mantilla Surez , Alexey Svyatkovskiy , Nhan Tran

The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing…

Data Analysis, Statistics and Probability · Physics 2017-11-03 Marco Meoni , Valentin Kuznetsov , Luca Menichetti , Justinas Rumševičius , Tommaso Boccali , Daniele Bonacorsi

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is…

Instrumentation and Methods for Astrophysics · Physics 2019-07-17 S. Plaszczynski , J. Peloton , C. Arnault , J. E. Campagne

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level,…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-30 Bilal Akil , Ying Zhou , Uwe Röhm

The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-17 Matteo Migliorini , Riccardo Castellotti , Luca Canali , Marco Zanetti

Efficient handling of large data-volumes becomes a necessity in today's world. It is driven by the desire to get more insight from the data and to gain a better understanding of user trends which can be transformed into economic incentives…

Data Analysis, Statistics and Probability · Physics 2019-10-02 Valentin Kuznetsov

At the heart of experimental high energy physics (HEP) is the development of facilities and instrumentation that provide sensitivity to new phenomena. Our understanding of nature at its most fundamental level is advanced through the…

With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular,…

Databases · Computer Science 2017-11-28 Anand Gupta , Hardeo Thakur , Ritvik Shrivastava , Pulkit Kumar , Sreyashi Nag

Data from high-energy physics experiments are collected with significant financial and human effort and are mostly unique. However, until recently no coherent strategy existed for data preservation and re-use, and many important and complex…

High Energy Physics - Experiment · Physics 2015-06-03 Roman Kogler , David M. South , Michael Steder

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly dependent on sophisticated dataflows and out-of-core methods for efficient system utilization. In addition, as HPC systems grow, memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-10-01 George K. Thiruvathukal , Cameron Christensen , Xiaoyong Jin , François Tessier , Venkatram Vishwanath

Exploratory data analysis tools must respond quickly to a user's questions, so that the answer to one question (e.g. a visualized histogram or fit) can influence the next. In some SQL-based query systems used in industry, even very large…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-09 Jim Pivarski , David Lange , Thanat Jatuphattharachat

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-23 Taha Tekdogan , Ali Cakmak

There are numerous approaches to building analysis applications across the high-energy physics community. Among them are Python-based, or at least Python-driven, analysis workflows. We aim to ease the adoption of a Python-based analysis…

Computational Physics · Physics 2018-04-25 David Lange

Particle physics has an ambitious and broad global experimental programme for the coming decades. Large investments in building new facilities are already underway or under consideration. Scaling the present processing power and data…

Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-24 Byung H. Park , Saurabh Hukerikar , Ryan Adamson , Christian Engelmann

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-15 Avinash Kumar

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the…

Software Engineering · Computer Science 2021-03-29 Zehao Wang
‹ Prev 1 2 3 10 Next ›