Related papers: Efficient Multi-stage Inference on Tabular Data

MPC-Minimized Secure LLM Inference

Many inference services based on large language models (LLMs) pose a privacy concern, either revealing user prompts to the service or the proprietary weights to the user. Secure inference offers a solution to this problem through secure…

Cryptography and Security · Computer Science 2024-08-08 Deevashwer Rathee , Dacheng Li , Ion Stoica , Hao Zhang , Raluca Popa

Towards an Efficient ML System: Unveiling a Trade-off between Task Accuracy and Engineering Efficiency in a Large-scale Car Sharing Platform

Upon the significant performance of the supervised deep neural networks, conventional procedures of developing ML system are \textit{task-centric}, which aims to maximize the task accuracy. However, we scrutinized this \textit{task-centric}…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Kyung Ho Park , Hyunhee Chung , Soonwoo Kwon

Hierarchical adaptive control for real-time dynamic inference at the edge

Industrial systems increasingly depend on Machine Learning (ML), and operate on heterogeneous nodes that must satisfy tight latency, energy, and memory constraints. Dynamic ML models, which reconfigure their computational footprint at…

Machine Learning · Computer Science 2026-04-30 Francesco Daghero , Mahyar Tourchi Moghaddam , Mikkel Baun Kjærgaard

FluidML: Fast and Memory Efficient Inference Optimization

Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not…

Machine Learning · Computer Science 2024-11-15 Jinjie Liu , Hang Qiu

MPC-Pipe: an Efficient Pipeline Scheme for Secure Multi-party Machine Learning Inference

Multi-party computing (MPC) has been gaining popularity as a secure computing model over the past few years. However, prior works have demonstrated that MPC protocols still pay substantial performance penalties compared to plaintext,…

Cryptography and Security · Computer Science 2024-08-28 Yongqin Wang , Rachit Rajat , Murali Annavaram

Efficient Tabular Data Preprocessing of ML Pipelines

Data preprocessing pipelines, which includes data decoding, cleaning, and transforming, are a crucial component of Machine Learning (ML) training. Thy are computationally intensive and often become a major bottleneck, due to the increasing…

Hardware Architecture · Computer Science 2024-09-24 Yu Zhu , Wenqi Jiang , Gustavo Alonso

AdaMTL: Adaptive Input-dependent Inference for Efficient Multi-Task Learning

Modern Augmented reality applications require performing multiple tasks on each input frame simultaneously. Multi-task learning (MTL) represents an effective approach where multiple tasks share an encoder to extract representative features…

Computer Vision and Pattern Recognition · Computer Science 2023-04-19 Marina Neseem , Ahmed Agiza , Sherief Reda

Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks, yet their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and…

Machine Learning · Computer Science 2025-11-07 Mingyu Sung , Vikas Palakonda , Suhwan Im , Sunghwan Moon , Il-Min Kim , Sangseok Yun , Jae-Mo Kang

Privacy-Preserving Hierarchical Model-Distributed Inference

This paper focuses on designing a privacy-preserving Machine Learning (ML) inference protocol for a hierarchical setup, where clients own/generate data, model owners (cloud servers) have a pre-trained ML model, and edge servers perform ML…

Cryptography and Security · Computer Science 2024-09-17 Fatemeh Jafarian Dehkordi , Yasaman Keshtkarjahromi , Hulya Seferoglu

Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation

As large language models (LLMs) have shown great success in many tasks, they are used in various applications. While a lot of works have focused on the efficiency of single-LLM application (e.g., offloading, request scheduling, parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-24 Jingzhi Fang , Yanyan Shen , Yue Wang , Lei Chen

Communication-Computation Efficient Device-Edge Co-Inference via AutoML

Device-edge co-inference, which partitions a deep neural network between a resource-constrained mobile device and an edge server, recently emerges as a promising paradigm to support intelligent mobile applications. To accelerate the…

Machine Learning · Computer Science 2021-09-01 Xinjie Zhang , Jiawei Shao , Yuyi Mao , Jun Zhang

LRD-MPC: Efficient MPC Inference through Low-rank Decomposition

Secure Multi-party Computation (MPC) enables untrusted parties to jointly compute a function without revealing their inputs. Its application to machine learning (ML) has gained significant attention, particularly for secure inference…

Cryptography and Security · Computer Science 2026-02-17 Tingting Tang , Yongqin Wang , Murali Annavaram

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-27 Zhongyi Lin , Ning Sun , Pallab Bhattacharya , Xizhou Feng , Louis Feng , John D. Owens

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges:…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-09-06 Seungbeom Choi , Sunho Lee , Yeonjae Kim , Jongse Park , Youngjin Kwon , Jaehyuk Huh

Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification

Driven by recent advances in artificial intelligence (AI), a growing literature has demonstrated the potential for using large language models (LLMs) as scalable surrogates to generate human-like responses in many business applications. Two…

Machine Learning · Computer Science 2025-12-30 Lei Wang , Zikun Ye , Jinglong Zhao

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an…

Computation and Language · Computer Science 2023-05-30 Zangwei Zheng , Xiaozhe Ren , Fuzhao Xue , Yang Luo , Xin Jiang , Yang You

Training and Serving Machine Learning Models at Scale

In recent years, Web services are becoming more and more intelligent (e.g., in understanding user preferences) thanks to the integration of components that rely on Machine Learning (ML). Before users can interact (inference phase) with an…

Software Engineering · Computer Science 2022-11-11 Luciano Baresi , Giovanni Quattrocchi

SampleLLM: Optimizing Tabular Data Synthesis in Recommendations

Tabular data synthesis is crucial in machine learning, yet existing general methods-primarily based on statistical or deep learning models-are highly data-dependent and often fall short in recommender systems. This limitation arises from…

Information Retrieval · Computer Science 2025-02-12 Jingtong Gao , Zhaocheng Du , Xiaopeng Li , Yichao Wang , Xiangyang Li , Huifeng Guo , Ruiming Tang , Xiangyu Zhao

Inference Performance Optimization for Large Language Models on CPUs

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Latency and Token-Aware Test-Time Compute

Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute…

Machine Learning · Computer Science 2025-09-15 Jenny Y. Huang , Mehul Damani , Yousef El-Kurdi , Ramon Astudillo , Wei Sun