Related papers: Bitnet.cpp: Efficient Edge Inference for Ternary L…

1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs

Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM…

Computation and Language · Computer Science 2024-10-24 Jinheng Wang , Hansong Zhou , Ting Song , Shaoguang Mao , Shuming Ma , Hongyu Wang , Yan Xia , Furu Wei

TENET: An Efficient Sparsity-Aware LUT-Centric Architecture for Ternary LLM Inference On Edge

Ternary quantization has emerged as a powerful technique for reducing both computational and memory footprint of large language models (LLM), enabling efficient real-time inference deployment without significantly compromising model…

Hardware Architecture · Computer Science 2025-09-18 Zhirui Huang , Rui Ma , Shijie Cao , Ran Shu , Ian Wang , Ting Cao , Chixiao Chen , Yongqiang Xiong

TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs

With the emergence of wearable devices and other embedded systems, deploying large language models (LLMs) on edge platforms has become an urgent need. However, this is challenging because of their high computational and memory demands.…

Hardware Architecture · Computer Science 2025-10-22 Ye Qiao , Zhiheng Chen , Yifan Zhang , Yian Wang , Sitao Huang

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make…

Machine Learning · Computer Science 2025-05-05 Mohsen Dehghankar , Mahdi Erfanian , Abolfazl Asudeh

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-26 Jianyu Wei , Shijie Cao , Ting Cao , Lingxiao Ma , Lei Wang , Yanyong Zhang , Mao Yang

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Deploying Large Language Models (LLMs) on resource-constrained edge devices faces critical bottlenecks in memory bandwidth and power consumption. While ternary quantization (e.g., BitNet b1.58) significantly reduces model size, its direct…

Hardware Architecture · Computer Science 2026-05-05 Zi-Wei Lin , Tian-Sheuan Chang

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with…

Hardware Architecture · Computer Science 2025-04-28 Ye Qiao , Zhiheng Chen , Yifan Zhang , Yian Wang , Sitao Huang

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness

Large language models (LLMs) face significant inference latency due to inefficiencies in GEMM operations, weight access, and KV cache access, especially in real-time scenarios. This highlights the need for a versatile compute-memory…

Hardware Architecture · Computer Science 2025-09-15 Huizheng Wang , Zichuan Wang , Zhiheng Yue , Yousheng Long , Taiquan Wei , Jianxun Yang , Yang Wang , Chao Li , Shaojun Wei , Yang Hu , Shouyi Yin

VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices

We present VitaLLM, a mixed precision accelerator that enables ternary weight large language models to run efficiently on edge devices. The design combines two compute cores, a multiplier free TINT core for ternary-INT projections and a…

Hardware Architecture · Computer Science 2026-05-04 Zi-Wei Lin , Tian-Sheuan Chang

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary…

Computation and Language · Computer Science 2024-02-28 Shuming Ma , Hongyu Wang , Lingxiao Ma , Lei Wang , Wenhui Wang , Shaohan Huang , Li Dong , Ruiping Wang , Jilong Xue , Furu Wei

DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference

Deploying Large Language Models (LLMs) on edge devices remains challenging due to their quadratically increasing computations with the sequence length. Existing studies for dynamic attention pruning are designed for hardware with massively…

Artificial Intelligence · Computer Science 2025-07-29 Jiawen Qi , Chang Gao , Zhaochun Ren , Qinyu Chen

Hardware Generation and Exploration of Lookup Table-Based Accelerators for 1.58-bit LLM Inference

Ternary weight quantization (e.g., BitNet b1.58) offers a promising path to mitigate the memory bandwidth bottleneck in Large Language Model (LLM) inference. However, conventional compute platforms lack native support for ternary-weight…

Hardware Architecture · Computer Science 2026-04-29 Robin Geens , Joran Heldens , Joren Dumoulin , Marian Verhelst

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has…

Machine Learning · Computer Science 2025-09-24 Yuxuan Cai , Xiaozhuan Liang , Xinghua Wang , Jin Ma , Haijin Liang , Jinwen Luo , Xinyu Zuo , Lisheng Duan , Yuyang Yin , Xi Chen

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong…

Artificial Intelligence · Computer Science 2026-05-29 Mincheol Kang , Hyunjin Lim , Bomin Kang , Daehee Park

LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

Large Language Model (LLM) inference becomes resource-intensive, prompting a shift toward low-bit model weights to reduce the memory footprint and improve efficiency. Such low-bit LLMs necessitate the mixed-precision matrix multiplication…

Hardware Architecture · Computer Science 2025-07-29 Zhiwen Mo , Lei Wang , Jianyu Wei , Zhichen Zeng , Shijie Cao , Lingxiao Ma , Naifeng Jing , Ting Cao , Jilong Xue , Fan Yang , Mao Yang

BitNet a4.8: 4-bit Activations for 1-bit LLMs

Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling…

Computation and Language · Computer Science 2024-11-08 Hongyu Wang , Shuming Ma , Furu Wei

LLMEmbed: Rethinking Lightweight LLM's Genuine Function in Text Classification

With the booming of Large Language Models (LLMs), prompt-learning has become a promising method mainly researched in various research areas. Recently, many attempts based on prompt-learning have been made to improve the performance of text…

Computation and Language · Computer Science 2024-06-07 Chun Liu , Hongguang Zhang , Kainan Zhao , Xinghai Ju , Lin Yang

ComplexityNet: Increasing LLM Inference Efficiency by Learning Task Complexity

We present ComplexityNet, a streamlined language model designed for assessing task complexity. This model predicts the likelihood of accurate output by various language models, each with different capabilities. Our initial application of…

Computation and Language · Computer Science 2024-10-16 Henry Bae , Aghyad Deeb , Alex Fleury , Kehang Zhu

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference

The proliferation of large language models (LLMs) is accelerating the integration of multimodal assistants into edge devices, where inference is executed under stringent latency and energy constraints, often exacerbated by intermittent…

Hardware Architecture · Computer Science 2026-01-29 Yanru Chen , Runyang Tian , Yue Pan , Zheyu Li , Weihong Xu , Tajana Rosing