Related papers: Optimizing Large-Scale ODE Simulations

GPU-Based Parallel Integration of Large Numbers of Independent ODE Systems

The task of integrating a large number of independent ODE systems arises in various scientific and engineering areas. For nonstiff systems, common explicit integration algorithms can be used on GPUs, where individual GPU threads…

Mathematical Software · Computer Science 2016-11-09 Kyle E Niemeyer , Chih-Jen Sung

Large-Scale Quantum Circuit Simulation on HPC Cluster via Cache Blocking, Boosting, and Gate Fusion Optimization

Quantum circuit simulation is crucial for the development of quantum algorithms, particularly given the high cost and noise limitations of physical quantum hardware. While full-state quantum circuit simulation is commonly employed for…

Quantum Physics · Physics 2026-04-15 Chuan-Chi Wang , Yan-Jie Wang , Chia-Heng Tu , Shih-Hao Hung

Accelerating finite-rate chemical kinetics with coprocessors: comparing vectorization methods on GPUs, MICs, and CPUs

Efficient ordinary differential equation solvers for chemical kinetics must take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical…

Computational Physics · Physics 2018-03-28 Christopher P. Stone , Andrew T. Alferman , Kyle E. Niemeyer

Mixed-Precision in adaptive Runge-Kutta method for large ODE systems

Mixed-precision methods combine low and high precision arithmetics to exploit low precision computational speed and high precision accuracy. Large ODE systems that contain many heterogeneous interactions lead to a high computational cost…

Numerical Analysis · Mathematics 2026-05-25 Mouhamad Al-Sayed , Samuel Bernard , Arsène Marzorati , Jonathan Rouzaud-Cornabas

Modeling the Linux page cache for accurate simulation of data-intensive applications

The emergence of Big Data in recent years has resulted in a growing need for efficient data processing solutions. While infrastructures with sufficient compute power are available, the I/O bottleneck remains. The Linux page cache is an…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-06 Hoang-Dung Do , Valerie Hayot-Sasson , Rafael Ferreira da Silva , Christopher Steele , Henri Casanova , Tristan Glatard

A Time Optimization Framework for the Implementation of Robust and Low-latency Quantum Circuits

Quantum computing has garnered attention for its potential to solve complex computational problems with considerable speedup. Despite notable advancements in the field, achieving meaningful scalability and noise control in quantum hardware…

Quantum Physics · Physics 2025-05-12 Eduardo Willwock Lussi , Rafael de Santiago , Eduardo Inacio Duzzioni

Accelerating State-Vector Quantum Simulation on Integrated GPUs via Cache Locality Optimization: A Cross-Architecture Evaluation

The classical simulation of quantum algorithms is a crucial tool for circuit development, testing, and validation. Although acceleration using GPUs significantly reduces simulation time, most high-performance simulators rely on…

Quantum Physics · Physics 2026-05-15 Gabriel Fernandes Thomaz , Jerusa Marchi , Eduarda Rodrigues Monteiro , Fernando Augusto Caletti de Barros , Evandro Chagas Ribeiro da Rosa

Optimizing the performance of Lattice Gauge Theory simulations with Streaming SIMD extensions

Two factors, which affect simulation quality are the amount of computing power and implementation. The Streaming SIMD (single instruction multiple data) extensions (SSE) present a technique for influencing both by exploiting the processor's…

Computational Engineering, Finance, and Science · Computer Science 2013-09-04 Shyam Srinivasan

Modular, general purpose ODE integration package to solve large number of independent ODE systems on GPUs

A general purpose, modular program package for the integration of large number of independent ordinary differential equation systems capable of using professional graphics cards is presented. The available numerical schemes are the explicit…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-10-10 Ferenc Hegedűs

Pushing the Limits of Online Auto-tuning: Machine Code Optimization in Short-Running Kernels

We propose an online auto-tuning approach for computing kernels. Differently from existing online auto-tuners, which regenerate code with long compilation chains from the source to the binary code, our approach consists on deploying…

Performance · Computer Science 2017-07-17 Fernando Endo , Damien Couroussé , Henri-Pierre Charles

EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models

CUDA kernel optimization has become a critical bottleneck for AI performance, as deep learning training and inference efficiency directly depends on highly optimized GPU kernels. Despite the promise of Large Language Models (LLMs) for…

Machine Learning · Computer Science 2025-10-07 Ping Guo , Chenyu Zhu , Siyuan Chen , Fei Liu , Xi Lin , Zhichao Lu , Qingfu Zhang

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed…

Hardware Architecture · Computer Science 2021-03-12 Xinfeng Xie , Peng Gu , Yufei Ding , Dimin Niu , Hongzhong Zheng , Yuan Xie

Efficient techniques to GPU Accelerations of Multi-Shot Quantum Computing Simulations

Quantum computers are becoming practical for computing numerous applications. However, simulating quantum computing on classical computers is still demanding yet useful because current quantum computers are limited because of computer…

Quantum Physics · Physics 2023-08-08 Jun Doi , Hiroshi Horii , Christopher Wood

Runge-Kutta Theory and Constraint Programming

There exist many Runge-Kutta methods (explicit or implicit), more or less adapted to specific problems. Some of them have interesting properties, such as stability for stiff problems or symplectic capability for problems with energy…

Numerical Analysis · Mathematics 2018-04-16 Julien Alexandre dit Sandretto

Platform independent profiling of a QCD code

The supercomputing platforms available for high performance computing based research evolve at a great rate. However, this rapid development of novel technologies requires constant adaptations and optimizations of the existing codes for…

High Energy Physics - Lattice · Physics 2017-02-23 Marina Krstic Marinkovic , Luka Stanisic

Methods for compressible fluid simulation on GPUs using high-order finite differences

We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this…

Computational Physics · Physics 2017-07-28 Johannes Pekkilä , Miikka S. Väisälä , Maarit J. Käpylä , Petri J. Käpylä , Omer Anjum

Exploiting network topology in brain-scale simulations of spiking neural networks

Simulation code for conventional supercomputers serves as a reference for neuromorphic computing systems. The present bottleneck of distributed large-scale spiking neuronal network simulations is the communication between compute nodes.…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-27 Melissa Lober , Markus Diesmann , Susanne Kunkel

Optimization in the Loop: Implementing and Testing Scheduling Algorithms with SimuLTE

One of the main purposes of discrete event simulators such as OMNeT++ is to test new algorithms or protocols in realistic environments. These often need to be benchmarked against optimal/theoretical results obtained by running commercial…

Networking and Internet Architecture · Computer Science 2015-09-14 Antonio Virdis

Strategic Scaling of Test-Time Compute: A Bandit Learning Approach

Scaling test-time compute has emerged as an effective strategy for improving the performance of large language models. However, existing methods typically allocate compute uniformly across all queries, overlooking variation in query…

Artificial Intelligence · Computer Science 2026-04-24 Bowen Zuo , Yinglun Zhu

Optimizing Performance of Recurrent Neural Networks on GPUs

As recurrent neural networks become larger and deeper, training times for single networks are rising into weeks or even months. As such there is a significant incentive to improve the performance and scalability of these networks. While…

Machine Learning · Computer Science 2016-04-08 Jeremy Appleyard , Tomas Kocisky , Phil Blunsom