English
Related papers

Related papers: Lifetime-Based Memory Management for Distributed D…

200 papers

Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-08 Jonathan Will , Lauritz Thamsen , Dominik Scheinert , Odej Kao

Spark is an in-memory analytics platform that targets commodity server environments today. It relies on the Hadoop Distributed File System (HDFS) to persist intermediate checkpoint states and final processing results. In Spark, immutable…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-22 Mijung Kim , Jun Li , Haris Volos , Manish Marwah , Alexander Ulanov , Kimberly Keeton , Joseph Tucek , Lucy Cherkasova , Le Xu , Pradeep Fernando

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-07 Ubaid Ullah Hafeez , Martin Maas , Mustafa Uysal , Richard McDougall

Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive amounts of data that do not always fit on the heap. Therefore, frameworks temporarily move long-lived objects outside the managed heap (off-heap) on…

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

Data analytic applications built upon big data processing frameworks such as Apache Spark are an important class of applications. Many of these applications are not latency-sensitive and thus can run as batch jobs in data centers. By…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-03 Vicent Sanz Marco , Ben Taylor , Barry Porter , Zheng Wang

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as distributed file system like HDFS, in-memory file system like…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-12-18 Jia Zou , Arun Iyengar , Chris Jermaine

Memory-compute disaggregation promises transparent elasticity, high utilization and balanced usage for resources in data centers by physically separating memory and compute into network-attached resource "blades". However, existing designs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-22 Seung-seob Lee , Yanpeng Yu , Yupeng Tang , Anurag Khandelwal , Lin Zhong , Abhishek Bhattacharjee

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such…

Performance · Computer Science 2018-05-09 Zhengyu Yang , Danlin Jia , Stratis Ioannidis , Ningfang Mi , Bo Sheng

Although a data processing system often works as a batch processing system, many enterprises deploy such a system as a service, which we call the service-oriented data processing system. It has been shown that in-memory data processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-03-30 Xuanhua Shi , Xiong Zhang , Ligang He , Hai Jin , Zhixiang Ke , Song Wu

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-09 Alexandru Uta , Bogdan Ghit , Ankur Dave , Jan Rellermeyer , Peter Boncz

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-14 Jonathan Will , Onur Arslan , Jonathan Bader , Dominik Scheinert , Lauritz Thamsen

As AI workloads drive increasing memory requirements, domain-specific accelerators need higher-density on-chip memory beyond what current SRAM scaling trends can provide. Simultaneously, the vast amounts of short-lived data in these…

Multi-core architectures feature an intricate hierarchy of cache memories, with multiple levels and sizes. To adequately decompose an application according to the traits of a particular memory hierarchy is a cumbersome task that may be…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-20 Hervé Paulino , Nuno Delgado

Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-27 Jonathan Will , Nico Treide , Lauritz Thamsen , Odej Kao

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs -- that neither lead to bottlenecks nor to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-11 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Dominik Scheinert , Odej Kao

Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload characteristics, since developing and tuning a…

Machine Learning · Computer Science 2019-08-23 Hongzi Mao , Malte Schwarzkopf , Shaileshh Bojja Venkatakrishnan , Zili Meng , Mohammad Alizadeh

Internet-scale web applications are becoming increasingly storage-intensive and rely heavily on in-memory object caching to attain required I/O performance. We argue that the emerging serverless computing paradigm provides a well-suited,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-29 Ao Wang , Jingyuan Zhang , Xiaolong Ma , Ali Anwar , Lukas Rupprecht , Dimitrios Skourtis , Vasily Tarasov , Feng Yan , Yue Cheng

Storage-class memory (SCM) combines the benefits of a solid-state memory, such as high-performance and robustness, with the archival capabilities and low cost of conventional hard-disk magnetic storage. Among candidate solid-state…

Hardware Architecture · Computer Science 2017-04-19 Wonil Choi , Mohammad Arjomand , Myoungsoo Jung , Mahmut Kandemir

Today's big data clusters based on the MapReduce paradigm are capable of executing analysis jobs with multiple priorities, providing differential latency guarantees. Traces from production systems show that the latency advantage of…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-17 Robert Birke , Isabelly Rocha , Juan Perez , Valerio Schiavoni , Pascal Felber , Lydia Y. Chen
‹ Prev 1 2 3 10 Next ›