Related papers: QStore: Quantization-Aware Compressed Model Storag…

NeurStore: Efficient In-database Deep Learning Model Management System

With the prevalence of in-database AI-powered analytics, there is an increasing demand for database systems to efficiently manage the ever-expanding number and size of deep learning models. However, existing database systems typically store…

Databases · Computer Science 2025-09-16 Siqi Xiang , Sheng Wang , Xiaokui Xiao , Cong Yue , Zhanhao Zhao , Beng Chin Ooi

FLStore: Efficient Federated Learning Storage for non-training workloads

Federated Learning (FL) is an approach for privacy-preserving Machine Learning (ML), enabling model training across multiple clients without centralized data collection. With an aggregator server coordinating training, aggregating model…

Machine Learning · Computer Science 2025-03-04 Ahmad Faraz Khan , Samuel Fountain , Ahmed M. Abdelmoniem , Ali R. Butt , Ali Anwar

MorphStore: Analytical Query Engine with a Holistic Compression-Enabled Processing Model

In this paper, we present MorphStore, an open-source in-memory columnar analytical query engine with a novel holistic compression-enabled processing model. Basically, compression using lightweight integer compression algorithms already…

Databases · Computer Science 2020-04-21 Patrick Damme , Annett Ungethüm , Johannes Pietrzyk , Alexander Krause , Dirk Habich , Wolfgang Lehner

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations…

Hardware Architecture · Computer Science 2025-04-22 Coleman Hooper , Charbel Sakr , Ben Keller , Rangharajan Venkatesan , Kurt Keutzer , Sophia Shao , Brucek Khailany

Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance

We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during…

Machine Learning · Computer Science 2023-02-01 Ian Colbert , Alessandro Pappalardo , Jakoba Petri-Koenig

QuantFace: Towards Lightweight Face Recognition by Synthetic Data Low-bit Quantization

Deep learning-based face recognition models follow the common trend in deep neural networks by utilizing full-precision floating-point networks with high computational costs. Deploying such networks in use-cases constrained by computational…

Computer Vision and Pattern Recognition · Computer Science 2022-06-22 Fadi Boutros , Naser Damer , Arjan Kuijper

QCore: Data-Efficient, On-Device Continual Calibration for Quantized Models -- Extended Version

We are witnessing an increasing availability of streaming data that may contain valuable information on the underlying processes. It is thus attractive to be able to deploy machine learning models on edge devices near sensors such that…

Machine Learning · Computer Science 2024-10-22 David Campos , Bin Yang , Tung Kieu , Miao Zhang , Chenjuan Guo , Christian S. Jensen

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Large Language Models (LLMs) achieve strong performance across tasks, but face storage and compute challenges on edge devices. We propose EntroLLM, a compression framework combining mixed quantization and entropy coding to reduce storage…

Machine Learning · Computer Science 2026-05-05 Arnab Sanyal , Gourav Datta , Prithwish Mukherjee , Sandeep P. Chinchali , Michael Orshansky

QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference

As machine learning inferences increasingly move to edge devices, adapting to diverse computational capabilities, hardware, and memory constraints becomes more critical. Instead of relying on a pre-trained model fixed for all future…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-01 Xiangchen Li , Saeid Ghafouri , Bo Ji , Hans Vandierendonck , Deepu John , Dimitrios S. Nikolopoulos

MF-QAT: Multi-Format Quantization-Aware Training for Elastic Inference

Quantization-aware training (QAT) is typically performed for a single target numeric format, while practical deployments often need to choose numerical precision at inference time based on hardware support or runtime constraints. We study…

Machine Learning · Computer Science 2026-04-02 Zifei Xu , Sayeh Sharify , Hesham Mostafa

Quantization-aware Matrix Factorization for Low Bit Rate Image Compression

Lossy image compression is essential for efficient transmission and storage. Traditional compression methods mainly rely on discrete cosine transform (DCT) or singular value decomposition (SVD), both of which represent image data in…

Image and Video Processing · Electrical Eng. & Systems 2025-03-28 Pooya Ashtari , Pourya Behmandpoor , Fateme Nateghi Haredasht , Jonathan H. Chen , Panagiotis Patrinos , Sabine Van Huffel

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources

Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however,…

Computation and Language · Computer Science 2026-03-19 Zhikai Li , Xiaoxuan Liu , Banghua Zhu , Zhen Dong , Qingyi Gu , Kurt Keutzer

Lossless and Near-Lossless Compression for Foundation Models

With the growth of model sizes and scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast literature about reducing model sizes, we…

Machine Learning · Computer Science 2024-04-24 Moshik Hershcovitch , Leshem Choshen , Andrew Wood , Ilias Enmouri , Peter Chin , Swaminathan Sundararaman , Danny Harnik

When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and…

Machine Learning · Computer Science 2025-12-23 Michael S. Zhang , Rishi A. Ruia , Arnav Kewalram , Saathvik Dharmapuram , Utkarsh Sharma , Kevin Zhu

QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models

The rise of large language models (LLMs) has significantly advanced various natural language processing (NLP) tasks. However, the resource demands of these models pose substantial challenges. Structured pruning is an effective approach to…

Machine Learning · Computer Science 2024-12-17 Changhai Zhou , Yuhua Zhou , Shijie Han , Qian Qiao , Hongguang Li

SkyStore: Cost-Optimized Object Storage Across Regions and Clouds

Modern applications span multiple clouds to reduce costs, avoid vendor lock-in, and leverage low-availability resources in another cloud. However, standard object stores operate within a single cloud, forcing users to manually manage data…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-03 Shu Liu , Xiangxi Mo , Moshik Hershcovitch , Henric Zhang , Audrey Cheng , Guy Girmonsky , Gil Vernik , Michael Factor , Tiemo Bang , Soujanya Ponnapalli , Natacha Crooks , Joseph E. Gonzalez , Danny Harnik , Ion Stoica

ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression

Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and…

Databases · Computer Science 2025-11-11 Zirui Wang , Tingfeng Lan , Zhaoyuan Su , Juncheng Yang , Yue Cheng

FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs

Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory…

Machine Learning · Computer Science 2025-04-29 Xilong Xie , Liang Wang , Limin Xiao , Meng Han , Lin Sun , Shuai Zheng , Xiangrong Xu

QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks.…

Machine Learning · Computer Science 2025-02-19 Jiajun Zhou , Yifan Yang , Kai Zhen , Ziyue Liu , Yequan Zhao , Ershad Banijamali , Athanasios Mouchtaris , Ngai Wong , Zheng Zhang

QuantFace: Efficient Quantization for Face Restoration

Diffusion models have been achieving remarkable performance in face restoration. However, the heavy computations hamper the widespread adoption of these models. In this work, we propose QuantFace, a novel low-bit quantization framework for…

Computer Vision and Pattern Recognition · Computer Science 2025-11-24 Jiatong Li , Libo Zhu , Haotong Qin , Jingkai Wang , Linghe Kong , Guihai Chen , Yulun Zhang , Xiaokang Yang