English
Related papers

Related papers: Algorithm-Based Checkpoint-Recovery for the Conjug…

200 papers

We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-21 Carlos Pachajoa , Markus Levonyak , Wilfried N. Gansterer , Jesper Larsson Träff

HPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs.…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-11 Yehonatan Fridman , Yaniv Snir , Harel Levin , Danny Hendler , Hagit Attiya , Gal Oren

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167--176] has shown how to combine such a verification mechanism (a stability test checking the…

Data Structures and Algorithms · Computer Science 2015-11-17 Massimiliano Fasi , Julien Langou , Yves Robert , Bora Ucar

The observed and expected continued growth in the number of nodes in large-scale parallel computers gives rise to two major challenges: global communication operations are becoming major bottlenecks due to their limited scalability, and the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-03 Markus Levonyak , Christina Pacher , Wilfried N. Gansterer

Recurrent neural networks (RNNs) are valued for their computational efficiency and reduced memory requirements on tasks involving long sequence lengths but require high memory-processor bandwidth to train. Checkpointing techniques can…

Neural and Evolutionary Computing · Computer Science 2024-12-17 Wadjih Bencheikh , Jan Finkbeiner , Emre Neftci

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-29 Diego Montezanti , Enzo Rucci , Armando De Giusti , Marcelo Naiouf , Dolores Rexachs , Emilio Luque

The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-18 Roman Iakymchuk , Maria Barreda , Stef Graillat , Jose I. Aliaga , Enrique S. Quintana-Orti

Dealing with hardware and software faults is an important problem as parallel and distributed systems scale to millions of processing cores and wide area networks. Traditional methods for dealing with faults include checkpoint-restart,…

Numerical Analysis · Computer Science 2014-12-24 David F. Gleich , Ananth Grama , Yao Zhu

Task-free online continual learning aims to alleviate catastrophic forgetting of the learner on a non-iid data stream. Experience Replay (ER) is a SOTA continual learning method, which is broadly used as the backbone algorithm for other…

Machine Learning · Computer Science 2021-08-24 Zhiyi Chen , Tong Lin

The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during…

As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability…

Numerical Analysis · Mathematics 2013-09-03 Tao Cui , Jinchao Xu , Chen-Song Zhang

The preconditioned conjugate gradient (PCG) algorithm is one of the most popular algorithms for solving large-scale linear systems Ax = b, where A is a symmetric positive definite matrix. Rather than computing residuals directly, it updates…

Numerical Analysis · Mathematics 2025-11-19 Thomas Bake , Erin Carson , Yuxin Ma

In container terminal yards, the Container Rehandling Problem (CRP) involves rearranging containers between stacks under specific operational rules, and it is a pivotal optimization challenge in intelligent container scheduling systems.…

Artificial Intelligence · Computer Science 2025-04-22 Ruoqi Wang , Jiawei Li

This paper presents a novel reinforcement learning (RL) framework for dynamically optimizing numerical precision in the preconditioned conjugate gradient (CG) method. By modeling precision selection as a Markov Decision Process (MDP), we…

Machine Learning · Computer Science 2025-06-09 Xinye Chen

One of the major challenges in using extreme scale systems efficiently is to mitigate the impact of faults. Application-level checkpoint/restart (CR) methods provide the best trade-off between productivity, robustness, and performance.…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-02 Marcos Maroñas , Sergi Mateo , Kai Keller , Leonardo Bautista-Gomez , Eduard Ayguadé , Vicenç Beltran

Even though iterative solvers like the Conjugate Gradients method (CG) have been studied for over fifty years, fault tolerance for such solvers has seen much attention in recent years. For iterative solvers, two major reliable strategies of…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-05-17 Kiril Dichev , Dimitrios S. Nikolopoulos

We introduce two new stochastic conjugate frameworks for a class of nonconvex and possibly also nonsmooth optimization problems. These frameworks are built upon Stochastic Recursive Gradient Algorithm (SARAH) and we thus refer to them as…

Optimization and Control · Mathematics 2023-10-23 Jiangshan Wang , Zheng Peng

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-08-08 Faisal Shahzad , Jonas Thies , Moritz Kreutzer , Thomas Zeiser , Georg Hager , Gerhard Wellein

Current Retrieval-Augmented Generation (RAG) systems typically employ a traditional two-stage pipeline: an embedding model for initial retrieval followed by a reranker for refinement. However, this paradigm suffers from significant…

Computation and Language · Computer Science 2026-01-14 Haowen Hou , Jie Yang

Stochastic nested optimization, including stochastic compositional, min-max and bilevel optimization, is gaining popularity in many machine learning applications. While the three problems share the nested structure, existing works often…

Machine Learning · Statistics 2021-06-28 Tianyi Chen , Yuejiao Sun , Wotao Yin
‹ Prev 1 2 3 10 Next ›