Related papers: Optimizing Large-Scale ODE Simulations
The task of integrating a large number of independent ODE systems arises in various scientific and engineering areas. For nonstiff systems, common explicit integration algorithms can be used on GPUs, where individual GPU threads…
Quantum circuit simulation is crucial for the development of quantum algorithms, particularly given the high cost and noise limitations of physical quantum hardware. While full-state quantum circuit simulation is commonly employed for…
Efficient ordinary differential equation solvers for chemical kinetics must take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical…
Mixed-precision methods combine low and high precision arithmetics to exploit low precision computational speed and high precision accuracy. Large ODE systems that contain many heterogeneous interactions lead to a high computational cost…
The emergence of Big Data in recent years has resulted in a growing need for efficient data processing solutions. While infrastructures with sufficient compute power are available, the I/O bottleneck remains. The Linux page cache is an…
Quantum computing has garnered attention for its potential to solve complex computational problems with considerable speedup. Despite notable advancements in the field, achieving meaningful scalability and noise control in quantum hardware…
The classical simulation of quantum algorithms is a crucial tool for circuit development, testing, and validation. Although acceleration using GPUs significantly reduces simulation time, most high-performance simulators rely on…
Two factors, which affect simulation quality are the amount of computing power and implementation. The Streaming SIMD (single instruction multiple data) extensions (SSE) present a technique for influencing both by exploiting the processor's…
A general purpose, modular program package for the integration of large number of independent ordinary differential equation systems capable of using professional graphics cards is presented. The available numerical schemes are the explicit…
We propose an online auto-tuning approach for computing kernels. Differently from existing online auto-tuners, which regenerate code with long compilation chains from the source to the binary code, our approach consists on deploying…
CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promise of Large Language Models (LLMs) for…
With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed…
Quantum computers are becoming practical for computing numerous applications. However, simulating quantum computing on classical computers is still demanding yet useful because current quantum computers are limited because of computer…
There exist many Runge-Kutta methods (explicit or implicit), more or less adapted to specific problems. Some of them have interesting properties, such as stability for stiff problems or symplectic capability for problems with energy…
The supercomputing platforms available for high performance computing based research evolve at a great rate. However, this rapid development of novel technologies requires constant adaptations and optimizations of the existing codes for…
We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this…
Simulation code for conventional supercomputers serves as a reference for neuromorphic computing systems. The present bottleneck of distributed large-scale spiking neuronal network simulations is the communication between compute nodes.…
One of the main purposes of discrete event simulators such as OMNeT++ is to test new algorithms or protocols in realistic environments. These often need to be benchmarked against optimal/theoretical results obtained by running commercial…
Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query…
As recurrent neural networks become larger and deeper, training times for single networks are rising into weeks or even months. As such there is a significant incentive to improve the performance and scalability of these networks. While…