Related papers: HyperParallel: A Supernode-Affinity AI Framework

HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures

The rapid evolution of Large Language Models (LLMs) towards long-context reasoning and sparse architectures has pushed memory requirements far beyond the capacity of individual device HBM. While emerging supernode architectures offer…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-04 Fangxin Liu , Qinghua Zhang , Hanjing Shen , Zhibo Liang , Li Jiang , Haibing Guan , Chong Bao , Xuefeng Jin

Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach

As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…

Machine Learning · Computer Science 2025-03-13 Ruifeng She , Bowen Pang , Kai Li , Zehua Liu , Tao Zhong

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures, such as Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), that excel at accelerating parallel…

Neural and Evolutionary Computing · Computer Science 2023-11-09 Jan Finkbeiner , Thomas Gmeinder , Mark Pupilli , Alexander Titterton , Emre Neftci

HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-25 Zewen Jin , Congkun Ai , Guangpeng Zhang , Hanbo Zhang , Haoran Wang , Shihan Xiao , Da Lei , Xuefeng Jin , Teng Su , Cheng Li

Hardware-Adaptive and Superlinear-Capacity Memristor-based Associative Memory

Brain-inspired computing aims to mimic cognitive functions like associative memory, the ability to recall complete patterns from partial cues. Memristor technology offers promising hardware for such neuromorphic systems due to its potential…

Machine Learning · Computer Science 2025-05-20 Chengping He , Mingrui Jiang , Keyi Shan , Szu-Hao Yang , Zefan Li , Shengbo Wang , Giacomo Pedretti , Jim Ignowski , Can Li

AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks

New types of machine learning hardware in development and entering the market hold the promise of revolutionizing deep learning in a manner as profound as GPUs. However, existing software frameworks and training algorithms for deep learning…

Machine Learning · Computer Science 2017-06-23 Alexander L. Gaunt , Matthew A. Johnson , Maik Riechert , Daniel Tarlow , Ryota Tomioka , Dimitrios Vytiniotis , Sam Webster

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in deep learning (DL). DL frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-11 Soojeong Kim , Gyeong-In Yu , Hojin Park , Sungwoo Cho , Eunji Jeong , Hyeonmin Ha , Sanha Lee , Joo Seong Jeong , Byung-Gon Chun

Hardware Acceleration for Neural Networks: A Comprehensive Survey

Neural networks have become dominant computational workloads across cloud and edge platforms, but their rapid growth in model size and deployment diversity has exposed hardware bottlenecks increasingly dominated by memory movement,…

Systems and Control · Electrical Eng. & Systems 2026-01-16 Bin Xu , Ayan Banerjee , Sandeep Gupta

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

The emergence of Superchips represents a significant advancement in next-generation AI hardware. These Superchips employ a tightly coupled heterogeneous architecture that integrates GPU and CPU on the same package, which offers…

Machine Learning · Computer Science 2025-09-26 Xinyu Lian , Masahiro Tanaka , Olatunji Ruwase , Minjia Zhang

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots

The rapid advancements in machine learning techniques have led to significant achievements in various real-world robotic tasks. These tasks heavily rely on fast and energy-efficient inference of deep neural network (DNN) models when…

Robotics · Computer Science 2024-05-30 Zekai Sun , Xiuxian Guan , Junming Wang , Haoze Song , Yuhao Qing , Tianxiang Shen , Dong Huang , Fangming Liu , Heming Cui

An Easy-to-use Scalable Framework for Parallel Recursive Backtracking

Supercomputers are equipped with an increasingly large number of cores to use computational power as a way of solving problems that are otherwise intractable. Unfortunately, getting serial algorithms to run in parallel to take advantage of…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-12-31 Faisal N. Abu-Khzam , Khuzaima Daudjee , Amer E. Mouawad , Naomi Nishimura

A Highly Parallel FPGA Implementation of Sparse Neural Network Training

We demonstrate an FPGA implementation of a parallel and reconfigurable architecture for sparse neural networks, capable of on-chip training and inference. The network connectivity uses pre-determined, structured sparsity to significantly…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-29 Sourya Dey , Diandian Chen , Zongyang Li , Souvik Kundu , Kuan-Wen Huang , Keith M. Chugg , Peter A. Beerel

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new…

Machine Learning · Computer Science 2024-10-14 Reza Abbasi , Sernam Lim

Overmind NSA: A Unified Neuro-Symbolic Computing Architecture with Approximate Nonlinear Activations and Preemptive Memory Bypass

Neuro-symbolic AI is gaining traction in domains such as large language models, scientific discovery, and autonomous systems due to its ability to combine perception with structured reasoning. However, its deployment is often constrained by…

Hardware Architecture · Computer Science 2026-04-20 Weilun Wang , Zirui Wang , Wantong Li

A Hardware-Aware Framework for Accelerating Neural Architecture Search Across Modalities

Recent advances in Neural Architecture Search (NAS) such as one-shot NAS offer the ability to extract specialized hardware-aware sub-network configurations from a task-specific super-network. While considerable effort has been employed…

Machine Learning · Computer Science 2022-05-24 Daniel Cummings , Anthony Sarah , Sharath Nittur Sridhar , Maciej Szankin , Juan Pablo Munoz , Sairam Sundaresan

Model-Parallel Model Selection for Deep Learning Systems

As deep learning becomes more expensive, both in terms of time and compute, inefficiencies in machine learning (ML) training prevent practical usage of state-of-the-art models for most users. The newest model architectures are simply too…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-15 Kabir Nagrecha

Pre-Defined Sparse Neural Networks with Hardware Acceleration

Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing…

Machine Learning · Computer Science 2024-10-30 Sourya Dey , Kuan-Wen Huang , Peter A. Beerel , Keith M. Chugg

Hardware-friendly Neural Network Architecture for Neuromorphic Computing

The hardware-software co-optimization of neural network architectures is becoming a major stream of research especially due to the emergence of commercial neuromorphic chips such as the IBM Truenorth and Intel Loihi. Development of specific…

Neural and Evolutionary Computing · Computer Science 2019-06-24 Roshan Gopalakrishnan , Yansong Chua , Ashish Jith Sreejith Kumar

ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By…

Computation and Language · Computer Science 2025-08-15 Keyu Chen , Zhifeng Shen , Daohai Yu , Haoqian Wu , Wei Wen , Jianfeng He , Ruizhi Qiao , Xing Sun

Memory-Guided Unified Hardware Accelerator for Mixed-Precision Scientific Computing

Recent hardware acceleration advances have enabled powerful specialized accelerators for finite element computations, spiking neural network inference, and sparse tensor operations. However, existing approaches face fundamental limitations:…

Hardware Architecture · Computer Science 2026-01-09 Chuanzhen Wang , Leo Zhang , Eric Liu