Related papers: MAPA: Multi-Accelerator Pattern Allocation Policy …
In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of…
To leverage data and computation capabilities of mobile devices, machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models, resulting in the new paradigm of edge learning. In this paper,…
Despite the increasing adoption of Field-Programmable Gate Arrays (FPGAs) in compute clouds, there remains a significant gap in programming tools and abstractions which can leverage network-connected, cloud-scale, multi-die FPGAs to…
Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and…
A large proportion of jobs submitted to modern computing clusters and data centers are parallelizable and capable of running on a flexible number of computing cores or servers. Although allocating more servers to such a job results in a…
Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models…
Along with the fast evolution of deep neural networks, the hardware system is also developing rapidly. As a promising solution achieving high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers,…
Multi-agent path planning is a challenging problem with numerous real-life applications. Running a centralized search such as A* in the combined state space of all units is complete and cost-optimal, but scales poorly, as the state space…
Image coding for multi-task applications, catering to both human perception and machine vision, has been extensively investigated. Existing methods often rely on multiple task-specific encoder-decoder pairs, leading to high overhead of…
We propose a distributed planning method with asynchronous execution for multi-agent pickup and delivery (MAPD) problems for environments with occasional delays in agents' activities and flexible endpoints. MAPD is a crucial problem…
The increasing complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption. Hardware acceleration tackles the ensuing challenges by designing processors and…
Nowadays the number of available processing cores within computing nodes which are used in recent clustered environments, are growing up with a rapid rate. Despite this trend, the number of available network interfaces in such computing…
As the number of resources on chip multiprocessors (CMPs) increases, the complexity of how to best allocate these resources increases drastically. Because the higher number of applications makes the interaction and impacts of various memory…
To address the rising demand for strong packet delivery guarantees in networking, we study a novel way to perform graph resource allocation. We first introduce allocation graphs, in which nodes can independently set local resource limits…
Communication and topology aware process mapping is a powerful approach to reduce communication time in parallel applications with known communication patterns on large, distributed memory systems. We address the problem as a quadratic…
Process mapping asks to assign vertices of a task graph to processing elements of a supercomputer such that the computational workload is balanced while the communication cost is minimized. Motivated by the recent success of GPU-based graph…
Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure…
To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in…
Recent innovations in Transformer-based large language models have significantly advanced the field of general-purpose neural language understanding and generation. With billions of trainable parameters, deployment of these large models…
Multi-agent navigation in dynamic environments is of great industrial value when deploying a large scale fleet of robot to real-world applications. This paper proposes a decentralized partially observable multi-agent path planning with…