Related papers: Exploring the Vision Processing Unit as Co-process…

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing

The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth…

Hardware Architecture · Computer Science 2025-12-19 Myunghyun Rhee , Joonseop Sim , Taeyoung Ahn , Seungyong Lee , Daegun Yoon , Euiseok Kim , Kyoung Park , Youngpyo Joo , Hoshik Kim

GPU coprocessors as a service for deep learning inference in high energy physics

In the next decade, the demands for computing in large scientific experiments are expected to grow tremendously. During the same time period, CPU performance increases will be limited. At the CERN Large Hadron Collider (LHC), these two…

Computational Physics · Physics 2021-04-26 Jeffrey Krupa , Kelvin Lin , Maria Acosta Flechas , Jack Dinsmore , Javier Duarte , Philip Harris , Scott Hauck , Burt Holzman , Shih-Chieh Hsu , Thomas Klijnsma , Mia Liu , Kevin Pedro , Dylan Rankin , Natchanon Suaysom , Matt Trahms , Nhan Tran

In-Datacenter Performance Analysis of a Tensor Processing Unit

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that…

Hardware Architecture · Computer Science 2017-04-18 Norman P. Jouppi , Cliff Young , Nishant Patil , David Patterson , Gaurav Agrawal , Raminder Bajwa , Sarah Bates , Suresh Bhatia , Nan Boden , Al Borchers , Rick Boyle , Pierre-luc Cantin , Clifford Chao , Chris Clark , Jeremy Coriell , Mike Daley , Matt Dau , Jeffrey Dean , Ben Gelb , Tara Vazir Ghaemmaghami , Rajendra Gottipati , William Gulland , Robert Hagmann , C. Richard Ho , Doug Hogberg , John Hu , Robert Hundt , Dan Hurt , Julian Ibarz , Aaron Jaffey , Alek Jaworski , Alexander Kaplan , Harshit Khaitan , Andy Koch , Naveen Kumar , Steve Lacy , James Laudon , James Law , Diemthu Le , Chris Leary , Zhuyuan Liu , Kyle Lucke , Alan Lundin , Gordon MacKean , Adriana Maggiore , Maire Mahony , Kieran Miller , Rahul Nagarajan , Ravi Narayanaswami , Ray Ni , Kathy Nix , Thomas Norrie , Mark Omernick , Narayana Penukonda , Andy Phelps , Jonathan Ross , Matt Ross , Amir Salek , Emad Samadiani , Chris Severn , Gregory Sizikov , Matthew Snelham , Jed Souter , Dan Steinberg , Andy Swing , Mercedes Tan , Gregory Thorson , Bo Tian , Horia Toma , Erick Tuttle , Vijay Vasudevan , Richard Walter , Walter Wang , Eric Wilcox , Doe Hyun Yoon

The Recurrent Processing Unit: Hardware for High Speed Machine Learning

Machine learning applications are computationally demanding and power intensive. Hardware acceleration of these software tools is a natural step being explored using various technologies. A recurrent processing unit (RPU) is fast and…

Emerging Technologies · Computer Science 2019-12-17 Heidi Komkov , Alessandro Restelli , Brian Hunt , Liam Shaughnessy , Itamar Shani , Daniel P. Lathrop

Conceptual and Technical Challenges for High Performance Computing

High Performance Computing (HPC) aims at providing reasonably fast computing solutions to scientific and real life problems. The advent of multicore architectures is noticeable in the HPC history, because it has brought the underlying…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-07 Claude Tadonki

Fast Object Detection with a Machine Learning Edge Device

This machine learning study investigates a lowcost edge device integrated with an embedded system having computer vision and resulting in an improved performance in inferencing time and precision of object detection and classification. A…

Robotics · Computer Science 2024-10-08 Richard C. Rodriguez , Jonah Elijah P. Bardos

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and…

Computer Vision and Pattern Recognition · Computer Science 2019-07-01 Murad Qasaimeh , Kristof Denolf , Jack Lo , Kees Vissers , Joseph Zambreno , Phillip H. Jones

Vis-TOP: Visual Transformer Overlay Processor

In recent years, Transformer has achieved good results in Natural Language Processing (NLP) and has also started to expand into Computer Vision (CV). Excellent models such as the Vision Transformer and Swin Transformer have emerged. At the…

Computer Vision and Pattern Recognition · Computer Science 2021-10-22 Wei Hu , Dian Xu , Zimeng Fan , Fang Liu , Yanxiang He

CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Moritz Nottebaum , Matteo Dunnhofer , Christian Micheloni

Multi-user Co-inference with Batch Processing Capable Edge Server

Graphics processing units (GPUs) can improve deep neural network inference throughput via batch processing, where multiple tasks are concurrently processed. We focus on novel scenarios that the energy-constrained mobile devices offload…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-14 Wenqi Shi , Sheng Zhou , Zhisheng Niu , Miao Jiang , Lu Geng

A Performance Comparison of Different Graphics Processing Units Running Direct N-Body Simulations

Hybrid computational architectures based on the joint power of Central Processing Units and Graphic Processing Units (GPUs) are becoming popular and powerful hardware tools for a wide range of simulations in biology, chemistry, engineering,…

Instrumentation and Methods for Astrophysics · Physics 2015-06-15 Roberto Capuzzo-Dolcetta , Mario Spera

Towards High Performance Computing (Hpc) Through Parallel Programming Paradigms and Their Principles

Nowadays, we are to find out solutions to huge computing problems very rapidly. It brings the idea of parallel computing in which several machines or processors work cooperatively for computational tasks. In the past decades, there are a…

Programming Languages · Computer Science 2014-02-07 Brijender Kahanwal

IPU: Flexible Hardware Introspection Units

Modern chip designs are increasingly complex, making it difficult for developers to glean meaningful insights about hardware behavior while real workloads are running. Hardware introspection aims to solve this by enabling the hardware…

Hardware Architecture · Computer Science 2025-09-29 Ian McDougall , Shayne Wadle , Harish Batchu , Karthikeyan Sankaralingam

Visual Perception Engine: Fast and Flexible Multi-Head Inference for Robotic Vision Tasks

Deploying multiple machine learning models on resource-constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work…

Robotics · Computer Science 2025-08-19 Jakub Łucki , Jonathan Becktor , Georgios Georgakis , Rob Royce , Shehryar Khattak

A Reconfigurable Vector Instruction Processor for Accelerating a Convection Parametrization Model on FPGAs

High Performance Computing (HPC) platforms allow scientists to model computationally intensive algorithms. HPC clusters increasingly use General-Purpose Graphics Processing Units (GPGPUs) as accelerators; FPGAs provide an attractive…

Hardware Architecture · Computer Science 2015-04-20 Syed Waqar Nabi , Saji N. Hameed , Wim Vanderbauwhede

High-Performance Computing with Quantum Processing Units

The prospects of quantum computing have driven efforts to realize fully functional quantum processing units (QPUs). Recent success in developing proof-of-principle QPUs has prompted the question of how to integrate these emerging processors…

Emerging Technologies · Computer Science 2015-12-10 Keith A. Britt , Travis S. Humble

Photonic tensor cores for machine learning

With an ongoing trend in computing hardware towards increased heterogeneity, domain-specific co-processors are emerging as alternatives to centralized paradigms. The tensor core unit (TPU) has shown to outperform graphic process units by…

Disordered Systems and Neural Networks · Physics 2020-11-24 Mario Miscuglio , Volker J. Sorger

On Performance Analysis of Graphcore IPUs: Analyzing Squared and Skewed Matrix Multiplication

In recent decades, High Performance Computing (HPC) has undergone significant enhancements, particularly in the realm of hardware platforms, aimed at delivering increased processing power while keeping power consumption within reasonable…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-03 S. -Kazem Shekofteh , Christian Alles , Nils Kochendörfer , Holger Fröning

Benchmarking Ultra-Low-Power $\mu$NPUs

Efficient on-device neural network (NN) inference offers predictable latency, improved privacy and reliability, and lower operating costs for vendors than cloud-based inference. This has sparked recent development of microcontroller-scale…

Machine Learning · Computer Science 2025-11-03 Josh Millar , Yushan Huang , Sarab Sethi , Hamed Haddadi , Anil Madhavapeddy

ZNNi - Maximizing the Inference Throughput of 3D Convolutional Networks on Multi-Core CPUs and GPUs

Sliding window convolutional networks (ConvNets) have become a popular approach to computer vision problems such as image segmentation, and object detection and localization. Here we consider the problem of inference, the application of a…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-21 Aleksandar Zlateski , Kisuk Lee , H. Sebastian Seung