English
Related papers

Related papers: APT-LLM: Exploiting Arbitrary-Precision Tensor Cor…

200 papers

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an…

Machine Learning · Computer Science 2025-07-29 Chao Zeng , Songwei Liu , Yusheng Xie , Hong Liu , Xiaojian Wang , Miao Wei , Shu Yang , Fangmin Chen , Xing Mei

The rapid development of large language models (LLM) has greatly enhanced everyday applications. While many FPGA-based accelerators, with flexibility for fine-grained data control, exhibit superior speed and energy efficiency compared to…

Hardware Architecture · Computer Science 2026-03-24 Zifan He , Shengyu Ye , Rui Ma , Yang Wang , Jason Cong

The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing…

Hardware Architecture · Computer Science 2026-05-05 Yuzong Chen , Chao Fang , Xilai Dai , Yuheng Wu , Thierry Tambe , Marian Verhelst , Mohamed S. Abdelfattah

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that…

Machine Learning · Computer Science 2024-04-09 Hongzheng Chen , Jiahao Zhang , Yixiao Du , Shaojie Xiang , Zichao Yue , Niansong Zhang , Yaohui Cai , Zhiru Zhang

Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on…

Hardware Architecture · Computer Science 2025-05-08 Yanbiao Liang , Huihong Shi , Haikuo Shao , Zhongfeng Wang

The demand for efficient large language model (LLM) inference has propelled the development of dedicated accelerators. As accelerators are vulnerable to hardware faults due to aging, variation, etc, existing accelerator designs often…

Hardware Architecture · Computer Science 2025-04-08 Tong Xie , Jiawang Zhao , Zishen Wan , Zuodong Zhang , Yuan Wang , Runsheng Wang , Ru Huang , Meng Li

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-18 Boyuan Feng , Yuke Wang , Tong Geng , Ang Li , Yufei Ding

With the widespread adoption of Large Language Models (LLMs), the demand for high-performance LLM inference services continues to grow. To meet this demand, a growing number of AI accelerators have been proposed, such as Google TPU, Huawei…

Hardware Architecture · Computer Science 2025-10-08 Tianhao Zhu , Dahu Feng , Erhu Feng , Yubin Xia

The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained…

Hardware Architecture · Computer Science 2025-03-03 Mingqiang Huang , Ao Shen , Kai Li , Haoxiang Peng , Boyu Li , Yupeng Su , Hao Yu

Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and…

Hardware Architecture · Computer Science 2025-05-06 Yufeng Gu , Alireza Khadem , Sumanth Umesh , Ning Liang , Xavier Servot , Onur Mutlu , Ravi Iyer , Reetuparna Das

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and…

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost…

Machine Learning · Computer Science 2024-04-17 Yilong Zhao , Chien-Yu Lin , Kan Zhu , Zihao Ye , Lequn Chen , Size Zheng , Luis Ceze , Arvind Krishnamurthy , Tianqi Chen , Baris Kasikci

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory…

Machine Learning · Computer Science 2022-11-11 Tim Dettmers , Mike Lewis , Younes Belkada , Luke Zettlemoyer

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which…

Machine Learning · Computer Science 2023-12-08 Haihao Shen , Hanwen Chang , Bo Dong , Yu Luo , Hengyu Meng

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve…

Computation and Language · Computer Science 2024-06-05 Bowen Zhao , Hannaneh Hajishirzi , Qingqing Cao

Processing-in-DRAM (DRAM-PIM) has emerged as a promising technology for accelerating memory-intensive operations in modern applications, such as Large Language Models (LLMs). Despite its potential, current software stacks for DRAM-PIM face…

Hardware Architecture · Computer Science 2025-06-03 Yongwon Shin , Dookyung Kang , Hyojin Sung

The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained…

Artificial Intelligence · Computer Science 2026-01-27 Evangelos Georganas , Dhiraj Kalamkar , Alexander Heinecke

Artificial intelligence (AI) methods have become critical in scientific applications to help accelerate scientific discovery. Large language models (LLMs) are being considered as a promising approach to address some of the challenging…

‹ Prev 1 2 3 10 Next ›