Related papers: The spatial computer: A model for energy-efficient…
Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerable algorithmic…
In the research area of parallel computation, the communication cost has been extensively studied, while the IO cost has been neglected. For big data computation, the assumption that the data fits in main memory no longer holds, and…
Large-scale graph processing has drawn great attention in recent years. Most of the modern-day datacenter workloads can be represented in the form of Graph Processing such as MapReduce etc. Consequently, a lot of designs for Domain-Specific…
Parallel applications are often unable to take full advantage of emerging parallel architectures due to scaling limitations, which arise due to inter-process communication. Performance models are used to analyze the sources of communication…
In this paper we study the tradeoff between parallelism and communication cost in a map-reduce computation. For any problem that is not "embarrassingly parallel," the finer we partition the work of the reducers so that more parallelism can…
Energy efficiency is a key requirement in the design of wireless sensor networks. While most theoretical studies only account for the energy requirements of communication, the sensing process, which includes measurements and compression,…
The cost of data movement on parallel systems varies greatly with machine architecture, job partition, and nearby jobs. Performance models that accurately capture the cost of data movement provide a tool for analysis, allowing for…
We consider a number of fundamental statistical and graph problems in the message-passing model, where we have $k$ machines (sites), each holding a piece of data, and the machines want to jointly solve a problem defined on the union of the…
We present a computational algorithm for computing short range forces between particles. The algorithm has two distinguishing features. First, it is optimized for multi-processor computers, and will use as many processors as are available.…
Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and…
The simulation of large ensembles of particles is usually parallelized by partitioning the domain spatially and using message passing to communicate between the processes handling neighboring subdomains. The particles are represented as…
Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing…
Spatial networks are a powerful framework for studying a large variety of systems belonging to a broad diversity of contexts: from transportation to biology, from epidemiology to communications, and migrations, to cite a few. Spatial…
Due to rapid data growth, statistical analysis of massive datasets often has to be carried out in a distributed fashion, either because several datasets stored in separate physical locations are all relevant to a given problem, or simply to…
Sequential computation is well understood but does not scale well with current technology. Within the next decade, systems will contain large numbers of processors with potentially thousands of processors per chip. Despite this, many…
Power management is an expensive and important issue for large computational infrastructures such as datacenters, large clusters, and computational grids. However, measuring energy consumption of scalable systems may be impractical due to…
We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes…
As vision-based robots navigate larger environments, their spatial memory grows without bound, eventually exhausting computational resources, particularly on embedded platforms (8-16GB shared memory, $<$30W) where adding hardware is not an…
The energy footprint of global data movement has surpassed 100 terawatt hours, costing more than 20 billion US dollars to the world economy. Depending on the number of switches, routers, and hubs between the source and destination nodes,…
The sparse matrix-vector multiply (SpMV) operation is a key computational kernel in many simulations and linear solvers. The large communication requirements associated with a reference implementation of a parallel SpMV result in poor…