Related papers: SMaLL: A Software Framework for portable Machine L…

DLL: A Blazing Fast Deep Neural Network Library

Deep Learning Library (DLL) is a new library for machine learning with deep neural networks that focuses on speed. It supports feed-forward neural networks such as fully-connected Artificial Neural Networks (ANNs) and Convolutional Neural…

Machine Learning · Computer Science 2018-04-15 Baptiste Wicht , Jean Hennebert , Andreas Fischer

MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs.…

Machine Learning · Computer Science 2025-06-13 Zhaode Wang , Jingbang Yang , Xinyu Qian , Shiwen Xing , Xiaotang Jiang , Chengfei Lv , Shengyu Zhang

DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion

Deep Neural Networks (DNNs) have emerged as the core enabler of many major applications on mobile devices. To achieve high accuracy, DNN models have become increasingly deep with hundreds or even thousands of operator layers, leading to…

Machine Learning · Computer Science 2021-12-02 Wei Niu , Jiexiong Guan , Yanzhi Wang , Gagan Agrawal , Bin Ren

TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems

Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent. But we must overcome major challenges before we can benefit from this opportunity. Embedded processors…

Machine Learning · Computer Science 2021-03-16 Robert David , Jared Duke , Advait Jain , Vijay Janapa Reddi , Nat Jeffries , Jian Li , Nick Kreeger , Ian Nappier , Meghna Natraj , Shlomi Regev , Rocky Rhodes , Tiezhen Wang , Pete Warden

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a…

Machine Learning · Computer Science 2022-05-31 Chengfei Lv , Chaoyue Niu , Renjie Gu , Xiaotang Jiang , Zhaode Wang , Bin Liu , Ziqi Wu , Qiulin Yao , Congyu Huang , Panos Huang , Tao Huang , Hui Shu , Jinde Song , Bin Zou , Peng Lan , Guohuan Xu , Fei Wu , Shaojie Tang , Fan Wu , Guihai Chen

RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on Edge

Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power while meeting the latency constraints of the target applications. Sparsity,…

Neural and Evolutionary Computing · Computer Science 2023-06-13 Adithya Krishna , Srikanth Rohit Nudurupati , Chandana D G , Pritesh Dwivedi , André van Schaik , Mahesh Mehendale , Chetan Singh Thakur

26ms Inference Time for ResNet-50: Towards Real-Time Execution of all DNNs on Smartphone

With the rapid emergence of a spectrum of high-end mobile devices, many applications that required desktop-level computation capability formerly can now run on these devices without any problem. However, without a careful optimization,…

Machine Learning · Computer Science 2019-05-03 Wei Niu , Xiaolong Ma , Yanzhi Wang , Bin Ren

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

Advancing research in the emerging field of deep graph learning requires new tools to support tensor computation over graphs. In this paper, we present the design principles and implementation of Deep Graph Library (DGL). DGL distills the…

Machine Learning · Computer Science 2020-08-26 Minjie Wang , Da Zheng , Zihao Ye , Quan Gan , Mufei Li , Xiang Song , Jinjing Zhou , Chao Ma , Lingfan Yu , Yu Gai , Tianjun Xiao , Tong He , George Karypis , Jinyang Li , Zheng Zhang

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for…

Programming Languages · Computer Science 2026-05-06 Size Zheng , Xuegui Zheng , Hanshi Sun , Qi Hou , Wenlei Bao , Shiyu Li , Haojie Duanmu , Jin Fang , Chenli Xue , Chenhui Huang , Yuanqiang Liu , Renze Chen , Ningxin Zheng , Dongyang Wang , Li-Wen Chang , Liqiang Lu , Yun Liang , Jidong Zhai , Xin Liu

ScaleDL: Towards Scalable and Efficient Runtime Prediction for Distributed Deep Learning Workloads

Deep neural networks (DNNs) form the cornerstone of modern AI services, supporting a wide range of applications, including autonomous driving, chatbots, and recommendation systems. As models increase in size and complexity, DNN workloads…

Machine Learning · Computer Science 2025-11-14 Xiaokai Wang , Shaoyuan Huang , Yuting Li , Xiaofei Wang

SCAN: A Scalable Neural Networks Framework Towards Compact and Efficient Models

Remarkable achievements have been attained by deep neural networks in various applications. However, the increasing depth and width of such models also lead to explosive growth in both storage and computation, which has restricted the…

Machine Learning · Computer Science 2019-06-11 Linfeng Zhang , Zhanhong Tan , Jiebo Song , Jingwei Chen , Chenglong Bao , Kaisheng Ma

SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding

Large language models (LLMs) have demonstrated exceptional proficiency in understanding and generating human language, but efficient inference on resource-constrained embedded devices remains challenging due to large model sizes and…

Hardware Architecture · Computer Science 2025-07-15 Weihong Xu , Haein Choi , Po-kai Hsu , Shimeng Yu , Tajana Rosing

DiviML: A Module-based Heuristic for Mapping Neural Networks onto Heterogeneous Platforms

Datacenters are increasingly becoming heterogeneous, and are starting to include specialized hardware for networking, video processing, and especially deep learning. To leverage the heterogeneous compute capability of modern datacenters, we…

Machine Learning · Computer Science 2023-08-03 Yassine Ghannane , Mohamed S. Abdelfattah

Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in…

Hardware Architecture · Computer Science 2020-12-22 Maurizio Capra , Beatrice Bussolino , Alberto Marchisio , Guido Masera , Maurizio Martina , Muhammad Shafique

ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation

High-level synthesis (HLS) has been widely adopted as it significantly improves the hardware design productivity and enables efficient design space exploration (DSE). Existing HLS tools are built using compiler infrastructures largely based…

Programming Languages · Computer Science 2021-12-23 Hanchen Ye , Cong Hao , Jianyi Cheng , Hyunmin Jeong , Jack Huang , Stephen Neuendorffer , Deming Chen

Multi-Agent Reinforcement Learning for Sample-Efficient Deep Neural Network Mapping

Mapping deep neural networks (DNNs) to hardware is critical for optimizing latency, energy consumption, and resource utilization, making it a cornerstone of high-performance accelerator design. Due to the vast and complex mapping space,…

Machine Learning · Computer Science 2025-07-23 Srivatsan Krishnan , Jason Jabbour , Dan Zhang , Natasha Jaques , Aleksandra Faust , Shayegan Omidshafiei , Vijay Janapa Reddi

An Open-Source Framework for Efficient Numerically-Tailored Computations

We present a versatile open-source framework designed to facilitate efficient, numerically-tailored Matrix-Matrix Multiplications (MMMs). The framework offers two primary contributions: first, a fine-tuned, automated pipeline for arithmetic…

Mathematical Software · Computer Science 2024-06-06 Louis Ledoux , Marc Casas

Compilation and Optimizations for Efficient Machine Learning on Embedded Systems

Deep Neural Networks (DNNs) have achieved great success in a variety of machine learning (ML) applications, delivering high-quality inferencing solutions in computer vision, natural language processing, and virtual reality, etc. However,…

Machine Learning · Computer Science 2022-08-29 Xiaofan Zhang , Yao Chen , Cong Hao , Sitao Huang , Yuhong Li , Deming Chen

Snap ML: A Hierarchical Framework for Machine Learning

We describe a new software framework for fast training of generalized linear models. The framework, named Snap Machine Learning (Snap ML), combines recent advances in machine learning systems and algorithms in a nested manner to reflect the…

Machine Learning · Computer Science 2018-11-30 Celestine Dünner , Thomas Parnell , Dimitrios Sarigiannis , Nikolas Ioannou , Andreea Anghel , Gummadi Ravi , Madhusudanan Kandasamy , Haralampos Pozidis

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

The research interest in specialized hardware accelerators for deep neural networks (DNN) spikes recently owing to their superior performance and efficiency. However, today's DNN accelerators primarily focus on accelerating specific…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-11 Cong Guo , Yangjie Zhou , Jingwen Leng , Yuhao Zhu , Zidong Du , Quan Chen , Chao Li , Bin Yao , Minyi Guo