Related papers: Warp-Level Parallelism: Enabling Multiple Replicat…
The definition of a Neural Network architecture is one of the most critical and challenging tasks to perform. In this paper, we propose ParallelMLPs. ParallelMLPs is a procedure to enable the training of several independent Multilayer…
Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in…
Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This…
Standard gradient-based iteration algorithms for optimization, such as gradient descent and its various proximal-based extensions to nonsmooth problems, are known to converge slowly for ill-conditioned problems, sometimes requiring many…
Domain-specific languages that execute image processing pipelineson GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have…
Large-scale atomistic simulations are essential to bridge computational materials and chemistry to realistic materials and drug discovery applications. In the past few years, rapid developments of machine learning interatomic potentials…
Witnessing the advancing scale and complexity of chip design and benefiting from high-performance computation technologies, the simulation of Very Large Scale Integration (VLSI) Circuits imposes an increasing requirement for acceleration…
Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for…
Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…
We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to use multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU…
Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…
Among the many possible approaches for the parallelization of self-organizing networks, and in particular of growing self-organizing networks, perhaps the most common one is producing an optimized, parallel implementation of the standard…
State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the…
The rapid adoption of large language models (LLMs) has shifted a substantial portion of inference workloads into throughput-oriented offline regimes, where fully utilizing GPU compute requires large batch sizes. However, existing…
As CPU clock speeds have stagnated and high performance computers continue to have ever higher core counts, increased parallelism is needed to take advantage of these new architectures. Traditional serial time-marching schemes can be a…
Multi-core machines are ubiquitous. However, most inductive logic programming (ILP) approaches use only a single core, which severely limits their scalability. To address this limitation, we introduce parallel techniques based on…
Artificial neural networks are a popular and effective machine learning technique. Great progress has been made parallelizing the expensive training phase of an individual network, leading to highly specialized pieces of hardware, many…
To achieve high performance on modern computers, it is vital to map algorithmic parallelism to that inherent in the hardware. From an application developer's perspective, it is also important that code can be maintained in a portable manner…
Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…
Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application…