Related papers: Efficient Data-Plane Memory Scheduling for In-Netw…

Scaling Distributed Machine Learning with In-Network Aggregation

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-01 Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan R. K. Ports , Peter Richtárik

Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms

The impact of transformer networks is booming, yet, they come with significant computational complexity. It is therefore essential to understand how to optimally map and execute these networks on modern neural processor hardware. So far,…

Hardware Architecture · Computer Science 2024-06-17 Steven Colleman , Arne Symons , Victor J. B. Jung , Marian Verhelst

SwitchDelta: Asynchronous Metadata Updating for Distributed Storage with In-Network Data Visibility

Distributed storage systems typically maintain strong consistency between data nodes and metadata nodes by adopting ordered writes: 1) first installing data; 2) then updating metadata to make data visible.We propose SwitchDelta to…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-26 Junru Li , Qing Wang , Zhe Yang , Shuo Liu , Jiwu Shu , Youyou Lu

MIND: In-Network Memory Management for Disaggregated Data Centers

Memory-compute disaggregation promises transparent elasticity, high utilization and balanced usage for resources in data centers by physically separating memory and compute into network-attached resource "blades". However, existing designs…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-22 Seung-seob Lee , Yanpeng Yu , Yupeng Tang , Anurag Khandelwal , Lin Zhong , Abhishek Bhattacharjee

SwitchAgg:A Further Step Towards In-Network Computation

Many distributed applications adopt a partition/aggregation pattern to achieve high performance and scalability. The aggregation process, which usually takes a large portion of the overall execution time, incurs large amount of network…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-09 Fan Yang , Zhan Wang , Xiaoxiao Ma , Guojun Yuan , Xuejun An

EASM: Efficiency-Aware Switch Migration for Balancing Controller Loads in Software-Defined Networking

Distributed multi-controller deployment is a promising method to achieve a scalable and reliable control plane of Software-Defined Networking (SDN). However, it brings a new challenge for balancing loads on the distributed controllers as…

Networking and Internet Architecture · Computer Science 2018-01-29 Tao Hu , Julong Lan , Jianhui Zhang , Wei Zhao

Memory-Based Set Point Modulation for Improved Transient Response of Distributed Energy Resources

As the composition of the power grid evolves to integrate more renewable generation, its reliance on distributed energy resources (DER) is increasing. Existing DERs are often controlled with proportional integral (PI) controllers that, if…

Systems and Control · Electrical Eng. & Systems 2024-05-14 Milad Beikbabaei , Brady Alexander , Ashwin Venkataramanan , Ali Mehrizi-Sani

Mesa: A Memory-saving Training Framework for Transformers

There has been an explosion of interest in designing high-performance Transformers. While Transformers have delivered significant performance improvements, training such networks is extremely memory intensive owing to storing all…

Computer Vision and Pattern Recognition · Computer Science 2022-08-30 Zizheng Pan , Peng Chen , Haoyu He , Jing Liu , Jianfei Cai , Bohan Zhuang

Improved Latency-Communication Trade-Off for Map-Shuffle-Reduce Systems with Stragglers

In a distributed computing system operating according to the map-shuffle-reduce framework, coding data prior to storage can be useful both to reduce the latency caused by straggling servers and to decrease the inter-server communication…

Information Theory · Computer Science 2018-08-22 Jingjing Zhang , Osvaldo Simeone

In-network Computation for Large-scale Federated Learning over Wireless Edge Networks

Most conventional Federated Learning (FL) models are using a star network topology where all users aggregate their local models at a single server (e.g., a cloud server). That causes significant overhead in terms of both communications and…

Information Theory · Computer Science 2022-06-30 Thinh Quang Dinh , Diep N. Nguyen , Dinh Thai Hoang , Pham Tran Vu , Eryk Dutkiewicz

Learning to Schedule: A Supervised Learning Framework for Network-Aware Scheduling of Data-Intensive Workloads

Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-21 Sankalpa Timilsina , Susmit Shannigrahi

Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation

Test-Time Adaptation (TTA) has emerged as an effective solution for adapting Vision Transformers (ViT) to distribution shifts without additional training data. However, existing TTA methods often incur substantial computational overhead,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Yizhe Xiong , Zihan Zhou , Yiwen Liang , Hui Chen , Zijia Lin , Tianxiang Hao , Fan Zhang , Jungong Han , Guiguang Ding

Communication-Efficient Separable Neural Network for Distributed Inference on Edge Devices

The inference of Neural Networks is usually restricted by the resources (e.g., computing power, memory, bandwidth) on edge devices. In addition to improving the hardware design and deploying efficient models, it is possible to aggregate the…

Machine Learning · Computer Science 2021-11-05 Jun-Liang Lin , Sheng-De Wang

Constrained In-network Computing with Low Congestion in Datacenter Networks

Distributed computing has become a common practice nowadays, where the recent focus has been given to the usage of smart networking devices with in-network computing capabilities. State-of-the-art switches with near-line rate computing and…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-13 Raz Segal , Chen Avin , Gabriel Scalosub

ESG: Pipeline-Conscious Efficient Scheduling of DNN Workflows on Serverless Platforms with Shareable GPUs

Recent years have witnessed increasing interest in machine learning inferences on serverless computing for its auto-scaling and cost effective properties. Existing serverless computing, however, lacks effective job scheduling methods to…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-26 Xinning Hui , Yuanchao Xu , Zhishan Guo , Xipeng Shen

Preparation Meets Opportunity: Enhancing Data Preprocessing for ML Training With Seneca

Input data preprocessing is a common bottleneck when concurrently training multimedia machine learning (ML) models in modern systems. To alleviate these bottlenecks and reduce the training time for concurrent jobs, we present Seneca, a data…

Operating Systems · Computer Science 2025-11-19 Omkar Desai , Ziyang Jiao , Shuyi Pei , Janki Bhimani , Bryan S. Kim

SATA: Sparsity-Aware Scheduling for Selective Token Attention

Transformers have become the foundation of numerous state-of-the-art AI models across diverse domains, thanks to their powerful attention mechanism for modeling long-range dependencies. However, the quadratic scaling complexity of attention…

Hardware Architecture · Computer Science 2026-01-29 Zhenkun Fan , Zishen Wan , Che-Kai Liu , Ashwin Sanjay Lele , Win-San Khwa , Bo Zhang , Meng-Fan Chang , Arijit Raychowdhury

Scalable and Efficient Intra- and Inter-node Interconnection Networks for Post-Exascale Supercomputers and Data centers

The rapid growth of data-intensive applications such as generative AI, scientific simulations, and large-scale analytics is driving modern supercomputers and data centers toward increasingly heterogeneous and tightly integrated…

Hardware Architecture · Computer Science 2025-11-07 Joaquin Tarraga-Moreno , Daniel Barley , Francisco J. Andujar Munoz , Jesus Escudero-Sahuquillo , Holger Froning , Pedro Javier Garcia , Francisco J. Quiles , Jose Duato

Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures

The increasing complexity of transformer models in artificial intelligence expands their computational costs, memory usage, and energy consumption. Hardware acceleration tackles the ensuing challenges by designing processors and…

Hardware Architecture · Computer Science 2023-12-21 Alireza Amirshahi , Giovanni Ansaloni , David Atienza

Utility Optimal Scheduling in Energy Harvesting Networks

In this paper, we show how to achieve close-to-optimal utility performance in energy harvesting networks with only finite capacity energy storage devices. In these networks, nodes are capable of harvesting energy from the environment. The…

Optimization and Control · Mathematics 2010-12-10 Longbo Huang , Michael J. Neely