Related papers: Deep Recommender Models Inference: Automatic Asymm…

DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud

Deep learning recommendation models (DLRM) rely on large embedding tables to manage categorical sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/CPU/memory usage.…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-01 Qinlong Wang , Tingfeng Lan , Yinghao Tang , Ziling Huang , Yiheng Du , Haitao Zhang , Jian Sha , Hui Lu , Yuanchun Zhou , Ke Zhang , Mingjie Tang

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-13 Michael Lui , Yavuz Yetim , Özgür Özkan , Zhuoran Zhao , Shin-Yeh Tsai , Carole-Jean Wu , Mark Hempstead

UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture

Deep Learning Recommendation Models (DLRMs) have gained popularity in recommendation systems due to their effectiveness in handling large-scale recommendation tasks. The embedding layers of DLRMs have become the performance bottleneck due…

Information Retrieval · Computer Science 2024-10-10 Sitian Chen , Haobin Tan , Amelie Chi Zhou , Yusen Li , Pavan Balaji

Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie…

Hardware Architecture · Computer Science 2024-10-30 Rishabh Jain , Vivek M. Bhasi , Adwait Jog , Anand Sivasubramaniam , Mahmut T. Kandemir , Chita R. Das

Mem-Rec: Memory Efficient Recommendation System using Alternative Representation

Deep learning-based recommendation systems (e.g., DLRMs) are widely used AI models to provide high-quality personalized recommendations. Training data used for modern recommendation systems commonly includes categorical features taking on…

Information Retrieval · Computer Science 2026-01-06 Gopi Krishna Jha , Anthony Thomas , Nilesh Jain , Sameh Gobriel , Tajana Rosing , Ravi Iyer

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-07 Xin Zhang , Quanyu Zhu , Liangbei Xu , Zain Huda , Wang Zhou , Jin Fang , Dennis van der Staay , Yuxi Hu , Jade Nie , Jiyan Yang , Chunzhi Yang

Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures

During the last two years, the goal of many researchers has been to squeeze the last bit of performance out of HPC system for AI tasks. Often this discussion is held in the context of how fast ResNet50 can be trained. Unfortunately,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-05-12 Dhiraj Kalamkar , Evangelos Georganas , Sudarshan Srinivasan , Jianping Chen , Mikhail Shiryaev , Alexander Heinecke

A Frequency-aware Software Cache for Large Recommendation System Embeddings

Deep learning recommendation models (DLRMs) have been widely applied in Internet companies. The embedding tables of DLRMs are too large to fit on GPU memory entirely. We propose a GPU-based software cache approaches to dynamically manage…

Information Retrieval · Computer Science 2022-08-11 Jiarui Fang , Geng Zhang , Jiatong Han , Shenggui Li , Zhengda Bian , Yongbin Li , Jin Liu , Yang You

MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions

Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads, recommendation inference is memory-bound due to the many random memory accesses needed to lookup the embedding tables. The…

Hardware Architecture · Computer Science 2021-02-22 Wenqi Jiang , Zhenhao He , Shuai Zhang , Thomas B. Preußer , Kai Zeng , Liang Feng , Jiansong Zhang , Tongxuan Liu , Yong Li , Jingren Zhou , Ce Zhang , Gustavo Alonso

Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems

Distributed systems can be found in various applications, e.g., in robotics or autonomous driving, to achieve higher flexibility and robustness. Thereby, data flow centric applications such as Deep Neural Network (DNN) inference benefit…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-14 Fabian Kreß , El Mahdi El Annabi , Tim Hotfilter , Julian Hoefer , Tanja Harbaum , Juergen Becker

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-01 Liu Ke , Udit Gupta , Carole-Jean Wu , Benjamin Youngjae Cho , Mark Hempstead , Brandon Reagen , Xuan Zhang , David Brooks , Vikas Chandra , Utku Diril , Amin Firoozshahian , Kim Hazelwood , Bill Jia , Hsien-Hsin S. Lee , Meng Li , Bert Maher , Dheevatsa Mudigere , Maxim Naumov , Martin Schatz , Mikhail Smelyanskiy , Xiaodong Wang

Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments

The deployment of large-scale models, such as large language models (LLMs), incurs substantial costs due to their computational demands. To mitigate these costs and address challenges related to scalability and data security, there is a…

Machine Learning · Computer Science 2025-08-14 Yipeng Du , Zihao Wang , Ahmad Farhan , Claudio Angione , Harry Yang , Fielding Johnston , James P. Buban , Patrick Colangelo , Yue Zhao , Yuzhe Yang

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

With the rapid growth of Internet services, recommendation systems play a central role in delivering personalized content. Faced with massive user requests and complex model architectures, the key challenge for real-time recommendation…

Information Retrieval · Computer Science 2025-08-14 Junli Shao , Jing Dong , Dingzhou Wang , Kowei Shih , Dannier Li , Chengrui Zhou

HE-LRM: Efficient Private Embedding Lookups for Neural Inference Using Fully Homomorphic Encryption

Fully Homomorphic Encryption (FHE) allows for computation directly on encrypted data and enables privacy-preserving neural inference in the cloud. Prior work has focused on models with dense inputs (e.g., CNNs), with less attention given to…

Cryptography and Security · Computer Science 2026-02-23 Karthik Garimella , Austin Ebel , Gabrielle De Micheli , Brandon Reagen

DIPPM: a Deep Learning Inference Performance Predictive Model using Graph Neural Networks

Deep Learning (DL) has developed to become a corner-stone in many everyday applications that we are now relying on. However, making sure that the DL model uses the underlying hardware efficiently takes a lot of effort. Knowledge about…

Performance · Computer Science 2023-03-22 Karthick Panner Selvam , Mats Brorsson

Deep Learning Inference Frameworks Benchmark

Deep learning (DL) has been widely adopted those last years but they are computing-intensive method. Therefore, scientists proposed diverse optimization to accelerate their predictions for end-user applications. However, no single inference…

Machine Learning · Computer Science 2022-10-11 Pierrick Pochelu

Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments

The deployment of large-scale models, such as large language models (LLMs) and sophisticated image generation systems, incurs substantial costs due to their computational demands. To mitigate these costs and address challenges related to…

Machine Learning · Computer Science 2024-10-30 Yuzhe Yang , Yipeng Du , Ahmad Farhan , Claudio Angione , Yue Zhao , Harry Yang , Fielding Johnston , James Buban , Patrick Colangelo

Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000$\times$ Compression and 3.1$\times$ Faster Inference

Deep learning for recommendation data is one of the most pervasive and challenging AI workload in recent times. State-of-the-art recommendation models are one of the largest models matching the likes of GPT-3 and Switch Transformer.…

Information Retrieval · Computer Science 2022-01-25 Aditya Desai , Li Chou , Anshumali Shrivastava

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of…

Hardware Architecture · Computer Science 2020-12-02 Benjamin Y. Cho , Jeageun Jung , Mattan Erez

Supporting Massive DLRM Inference Through Software Defined Memory

Deep Learning Recommendation Models (DLRM) are widespread, account for a considerable data center footprint, and grow by more than 1.5x per year. With model size soon to be in terabytes range, leveraging Storage ClassMemory (SCM) for…

Hardware Architecture · Computer Science 2021-11-10 Ehsan K. Ardestani , Changkyu Kim , Seung Jae Lee , Luoshang Pan , Valmiki Rampersad , Jens Axboe , Banit Agrawal , Fuxun Yu , Ansha Yu , Trung Le , Hector Yuen , Shishir Juluri , Akshat Nanda , Manoj Wodekar , Dheevatsa Mudigere , Krishnakumar Nair , Maxim Naumov , Chris Peterson , Mikhail Smelyanskiy , Vijay Rao