Related papers: Parallelized sequential composition, pipelines, an…
Modern processors deploy a variety of weak memory models, which for efficiency reasons may execute instructions in an order different to that specified by the program text. The consequences of instruction reordering can be complex and…
Sequential computation is well understood but does not scale well with current technology. Within the next decade, systems will contain large numbers of processors with potentially thousands of processors per chip. Despite this, many…
Modern processors deploy a variety of weak memory models, which for efficiency reasons may (appear to) execute instructions in an order different to that specified by the program text. The consequences of instruction reordering can be…
Memory consistency models define the order in which accesses to shared memory in a concurrent system may be observed to occur. Such models are a necessity since program order is not a reliable indicator of execution order, due to…
The concurrent refinement algebra has been developed to support rely/guarantee reasoning about concurrent programs. The algebra supports atomic commands and defines parallel composition as a synchronous operation, as in Milner's SCCS. In…
As quantum computers continue to improve and support larger, more complex computations, smart control hardware and compilers are needed to efficiently leverage the capabilities of these systems. This paper introduces a novel approach to…
We consider a parallel computational model that consists of $P$ processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded…
The memory consistency model is a fundamental system property characterizing a multiprocessor. The relative merits of strict versus relaxed memory models have been widely debated in terms of their impact on performance, hardware complexity…
Modern shared memory multiprocessors permit reordering of memory operations for performance reasons. These reorderings are often a source of subtle bugs in programs written for such architectures. Traditional approaches to verify weak…
We develop a novel parallel resampling algorithm for fully parallelized particle filters, which is designed with GPUs (graphics processing units) or similar parallel computing devices in mind. With our new algorithm, a full cycle of…
The unknown parameters of simulation models often need to be calibrated using observed data. When simulation models are expensive, calibration is usually carried out with an emulator. The effectiveness of the calibration process can be…
Side-channel attacks are a security exploit that take advantage of information leakage. They use measurement and analysis of physical parameters to reverse engineer and extract secrets from a system. Power analysis attacks in particular,…
To improve efficiency, nearly all parallel processing units (CPUs and GPUs) implement relaxed memory models in which memory operations may be re-ordered, i.e., executed out-of-order. Prior testing work in this area found that memory…
There are billions of lines of sequential code inside nowadays' software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the…
Memory consistency models are notorious for being difficult to define precisely, to reason about, and to verify. More than a decade of effort has gone into nailing down the definitions of the ARM and IBM Power memory models, and yet there…
Second order stationary models in time series analysis are based on the analysis of essential statistics whose computations follow a common pattern. In particular, with a map-reduce nomenclature, most of these operations can be modeled as…
Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the…
Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block, and show that the lifespan of the…
The processor accelerators are effective because they are working not (completely) on principles of stored program computers. They use some kind of parallelism, and it is rather hard to program them effectively: a parallel architecture by…
Based on the two observations that diverse applications perform better on different multicore architectures, and that different phases of an application may have vastly different resource requirements, Pal et al. proposed a novel…