English
Related papers

Related papers: Warp-Level Parallelism: Enabling Multiple Replicat…

200 papers

The definition of a Neural Network architecture is one of the most critical and challenging tasks to perform. In this paper, we propose ParallelMLPs. ParallelMLPs is a procedure to enable the training of several independent Multilayer…

Machine Learning · Computer Science 2022-06-20 Felipe Costa Farias , Teresa Bernarda Ludermir , Carmelo Jose Albanez Bastos-Filho

Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in…

Hardware Architecture · Computer Science 2024-04-10 Diya Joseph , Juan Luis Aragón , Joan-Manuel Parcerisa , Antonio Gonzalez

Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-27 Rodrigo Huerta , Antonio González

Standard gradient-based iteration algorithms for optimization, such as gradient descent and its various proximal-based extensions to nonsmooth problems, are known to converge slowly for ill-conditioned problems, sometimes requiring many…

Numerical Analysis · Mathematics 2026-03-24 G. H. M. Araújo , O. A. Krzysik , H. De Sterck

Domain-specific languages that execute image processing pipelineson GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have…

Programming Languages · Computer Science 2020-09-09 Abhinav Jangda , Arjun Guha

Large-scale atomistic simulations are essential to bridge computational materials and chemistry to realistic materials and drug discovery applications. In the past few years, rapid developments of machine learning interatomic potentials…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-03 Kevin Han , Bowen Deng , Amir Barati Farimani , Gerbrand Ceder

Witnessing the advancing scale and complexity of chip design and benefiting from high-performance computation technologies, the simulation of Very Large Scale Integration (VLSI) Circuits imposes an increasing requirement for acceleration…

Data Structures and Algorithms · Computer Science 2023-04-27 Weijie Fang , Yanggeng Fu , Jiaquan Gao , Longkun Guo , Gregory Gutin , Xiaoyan Zhang

Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for…

Programming Languages · Computer Science 2018-07-05 Vladimir Kiriansky , Haoran Xu , Martin Rinard , Saman Amarasinghe

Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…

Machine Learning · Computer Science 2025-04-22 Zhouyang Li , Yuliang Liu , Wei Zhang , Tailing Yuan , Bin Chen , Chengru Song , Di Zhang

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to use multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU…

Machine Learning · Computer Science 2020-11-10 Lei Guan , Wotao Yin , Dongsheng Li , Xicheng Lu

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Among the many possible approaches for the parallelization of self-organizing networks, and in particular of growing self-organizing networks, perhaps the most common one is producing an optimized, parallel implementation of the standard…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-03-31 Giacomo Parigi , Angelo Stramieri , Danilo Pau , Marco Piastra

State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-11-26 Parisa Jalili Marandi , Carlos Eduardo Bezerra , Fernando Pedone

The rapid adoption of large language models (LLMs) has shifted a substantial portion of inference workloads into throughput-oriented offline regimes, where fully utilizing GPU compute requires large batch sizes. However, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-28 Alan Zhao , Cyril Y. He

As CPU clock speeds have stagnated and high performance computers continue to have ever higher core counts, increased parallelism is needed to take advantage of these new architectures. Traditional serial time-marching schemes can be a…

Numerical Analysis · Mathematics 2022-08-29 David A. Vargas , Robert D. Falgout , Stefanie Günther , Jacob B. Schroder

Multi-core machines are ubiquitous. However, most inductive logic programming (ILP) approaches use only a single core, which severely limits their scalability. To address this limitation, we introduce parallel techniques based on…

Artificial Intelligence · Computer Science 2021-09-16 Andrew Cropper , Oghenejokpeme Orhobor , Cristian Dinu , Rolf Morel

Artificial neural networks are a popular and effective machine learning technique. Great progress has been made parallelizing the expensive training phase of an individual network, leading to highly specialized pieces of hardware, many…

Numerical Analysis · Computer Science 2017-10-03 Jacob B. Schroder

To achieve high performance on modern computers, it is vital to map algorithmic parallelism to that inherent in the hardware. From an application developer's perspective, it is also important that code can be maintained in a portable manner…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-20 Alan Gray , Kevin Stratford

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Fabian Knorr , Philip Salzmann , Peter Thoman , Thomas Fahringer

Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application…

Software Engineering · Computer Science 2021-03-30 Lishan Yang , Bin Nie , Adwait Jog , Evgenia Smirni
‹ Prev 1 2 3 10 Next ›