Related papers: Lifetime-Based Memory Management for Distributed D…

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Jonathan Will , Lauritz Thamsen , Dominik Scheinert , Odej Kao

Sparkle: Optimizing Spark for Large Memory Machines and Analytics

Spark is an in-memory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-22 Mijung Kim , Jun Li , Haris Volos , Manish Marwah , Alexander Ulanov , Kimberly Keeton , Joseph Tucek , Lucy Cherkasova , Le Xu , Pradeep Fernando

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-07 Ubaid Ullah Hafeez , Martin Maas , Mustafa Uysal , Richard McDougall

Garbage Collection or Serialization? Between a Rock and a Hard Place!

Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive amounts of data that do not always fit on the heap. Therefore, frameworks temporarily move long-lived objects outside the managed heap (off-heap) on…

Programming Languages · Computer Science 2023-01-10 Iacovos G. Kolokasis , Giannos Evdorou , Anastasios Papagiannis , Foivos Zakkak , Christos Kozanitis , Shoaib Akram , Polyvios Pratikakis , Angelos Bilas

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

Improving Spark Application Throughput Via Memory Aware Task Co-location: A Mixture of Experts Approach

Data analytic applications built upon big data processing frameworks such as Apache Spark are an important class of applications. Many of these applications are not latency-sensitive and thus can run as batch jobs in data centers. By…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-03 Vicent Sanz Marco , Ben Taylor , Barry Porter , Zheng Wang

Pangea: Monolithic Distributed Storage for Data Analytics

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as distributed file system like HDFS, in-memory file system like…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-18 Jia Zou , Arun Iyengar , Chris Jermaine

MIND: In-Network Memory Management for Disaggregated Data Centers

Memory-compute disaggregation promises transparent elasticity, high utilization and balanced usage for resources in data centers by physically separating memory and compute into network-attached resource "blades". However, existing designs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-22 Seung-seob Lee , Yanpeng Yu , Yupeng Tang , Anurag Khandelwal , Lin Zhong , Abhishek Bhattacharjee

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such…

Performance · Computer Science 2018-05-09 Zhengyu Yang , Danlin Jia , Stratis Ioannidis , Ningfang Mi , Bo Sheng

MURS: Mitigating Memory Pressure in Service-oriented Data Processing Systems

Although a data processing system often works as a batch processing system, many enterprises deploy such a system as a service, which we call the service-oriented data processing system. It has been shown that in-memory data processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-30 Xuanhua Shi , Xiong Zhang , Ligang He , Hai Jin , Zhixiang Ke , Song Wu

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-09 Alexandru Uta , Bogdan Ghit , Ankur Dave , Jan Rellermeyer , Peter Boncz

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-14 Jonathan Will , Onur Arslan , Jonathan Bader , Dominik Scheinert , Lauritz Thamsen

GainSight: A Unified Framework for Data Lifetime Profiling and Heterogeneous Memory Composition

As AI workloads drive increasing memory requirements, domain-specific accelerators need higher-density on-chip memory beyond what current SRAM scaling trends can provide. Simultaneously, the vast amounts of short-lived data in these…

Hardware Architecture · Computer Science 2025-08-06 Peijing Li , Matthew Hung , Yiming Tan , Konstantin Hoßfeld , Jake Cheng Jiajun , Shuhan Liu , Lixian Yan , Xinxin Wang , Philip Levis , H. -S. Philip Wong , Thierry Tambe

Cache-Conscious Run-time Decomposition of Data Parallel Computations

Multi-core architectures feature an intricate hierarchy of cache memories, with multiple levels and sizes. To adequately decompose an application according to the traits of a particular memory hierarchy is a cumbersome task that may be…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-20 Hervé Paulino , Nuno Delgado

Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling

Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-27 Jonathan Will , Nico Treide , Lauritz Thamsen , Odej Kao

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-11 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Dominik Scheinert , Odej Kao

Learning Scheduling Algorithms for Data Processing Clusters

Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a…

Machine Learning · Computer Science 2019-08-23 Hongzi Mao , Malte Schwarzkopf , Shaileshh Bojja Venkatakrishnan , Zili Meng , Mohammad Alizadeh

InfiniCache: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache

Internet-scale web applications are becoming increasingly storage-intensive and rely heavily on in-memory object caching to attain required I/O performance. We argue that the emerging serverless computing paradigm provides a well-suited,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-29 Ao Wang , Jingyuan Zhang , Xiaolong Ma , Ali Anwar , Lukas Rupprecht , Dimitrios Skourtis , Vasily Tarasov , Feng Yan , Yue Cheng

Exploiting Data Longevity for Enhancing the Lifetime of Flash-based Storage Class Memory

Storage-class memory (SCM) combines the benefits of a solid-state memory, such as high-performance and robustness, with the archival capabilities and low cost of conventional hard-disk magnetic storage. Among candidate solid-state…

Hardware Architecture · Computer Science 2017-04-19 Wonil Choi , Mohammad Arjomand , Myoungsoo Jung , Mahmut Kandemir

Differential Approximation and Sprinting for Multi-Priority Big Data Engines

Today's big data clusters based on the MapReduce paradigm are capable of executing analysis jobs with multiple priorities, providing differential latency guarantees. Traces from production systems show that the latency advantage of…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-17 Robert Birke , Isabelly Rocha , Juan Perez , Valerio Schiavoni , Pascal Felber , Lydia Y. Chen