English
Related papers

Related papers: Bitnet.cpp: Efficient Edge Inference for Ternary L…

200 papers

Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM…

Computation and Language · Computer Science 2024-10-24 Jinheng Wang , Hansong Zhou , Ting Song , Shaoguang Mao , Shuming Ma , Hongyu Wang , Yan Xia , Furu Wei

Ternary quantization has emerged as a powerful technique for reducing both computational and memory footprint of large language models (LLM), enabling efficient real-time inference deployment without significantly compromising model…

Hardware Architecture · Computer Science 2025-09-18 Zhirui Huang , Rui Ma , Shijie Cao , Ran Shu , Ian Wang , Ting Cao , Chixiao Chen , Yongqiang Xiong

With the emergence of wearable devices and other embedded systems, deploying large language models (LLMs) on edge platforms has become an urgent need. However, this is challenging because of their high computational and memory demands.…

Hardware Architecture · Computer Science 2025-10-22 Ye Qiao , Zhiheng Chen , Yifan Zhang , Yian Wang , Sitao Huang

Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make…

Machine Learning · Computer Science 2025-05-05 Mohsen Dehghankar , Mahdi Erfanian , Abolfazl Asudeh

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-26 Jianyu Wei , Shijie Cao , Ting Cao , Lingxiao Ma , Lei Wang , Yanyong Zhang , Mao Yang

Deploying Large Language Models (LLMs) on resource-constrained edge devices faces critical bottlenecks in memory bandwidth and power consumption. While ternary quantization (e.g., BitNet b1.58) significantly reduces model size, its direct…

Hardware Architecture · Computer Science 2026-05-05 Zi-Wei Lin , Tian-Sheuan Chang

Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with…

Hardware Architecture · Computer Science 2025-04-28 Ye Qiao , Zhiheng Chen , Yifan Zhang , Yian Wang , Sitao Huang

Large language models (LLMs) face significant inference latency due to inefficiencies in GEMM operations, weight access, and KV cache access, especially in real-time scenarios. This highlights the need for a versatile compute-memory…

Hardware Architecture · Computer Science 2025-09-15 Huizheng Wang , Zichuan Wang , Zhiheng Yue , Yousheng Long , Taiquan Wei , Jianxun Yang , Yang Wang , Chao Li , Shaojun Wei , Yang Hu , Shouyi Yin

We present VitaLLM, a mixed precision accelerator that enables ternary weight large language models to run efficiently on edge devices. The design combines two compute cores, a multiplier free TINT core for ternary-INT projections and a…

Hardware Architecture · Computer Science 2026-05-04 Zi-Wei Lin , Tian-Sheuan Chang

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary…

Computation and Language · Computer Science 2024-02-28 Shuming Ma , Hongyu Wang , Lingxiao Ma , Lei Wang , Wenhui Wang , Shaohan Huang , Li Dong , Ruiping Wang , Jilong Xue , Furu Wei

Deploying Large Language Models (LLMs) on edge devices remains challenging due to their quadratically increasing computations with the sequence length. Existing studies for dynamic attention pruning are designed for hardware with massively…

Artificial Intelligence · Computer Science 2025-07-29 Jiawen Qi , Chang Gao , Zhaochun Ren , Qinyu Chen

Ternary weight quantization (e.g., BitNet b1.58) offers a promising path to mitigate the memory bandwidth bottleneck in Large Language Model (LLM) inference. However, conventional compute platforms lack native support for ternary-weight…

Hardware Architecture · Computer Science 2026-04-29 Robin Geens , Joran Heldens , Joren Dumoulin , Marian Verhelst

As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has…

Machine Learning · Computer Science 2025-09-24 Yuxuan Cai , Xiaozhuan Liang , Xinghua Wang , Jin Ma , Haijin Liang , Jinwen Luo , Xinyu Zuo , Lisheng Duan , Yuyang Yin , Xi Chen

Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong…

Artificial Intelligence · Computer Science 2026-05-29 Mincheol Kang , Hyunjin Lim , Bomin Kang , Daehee Park

Large Language Model (LLM) inference becomes resource-intensive, prompting a shift toward low-bit model weights to reduce the memory footprint and improve efficiency. Such low-bit LLMs necessitate the mixed-precision matrix multiplication…

Hardware Architecture · Computer Science 2025-07-29 Zhiwen Mo , Lei Wang , Jianyu Wei , Zhichen Zeng , Shijie Cao , Lingxiao Ma , Naifeng Jing , Ting Cao , Jilong Xue , Fan Yang , Mao Yang

Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling…

Computation and Language · Computer Science 2024-11-08 Hongyu Wang , Shuming Ma , Furu Wei

With the booming of Large Language Models (LLMs), prompt-learning has become a promising method mainly researched in various research areas. Recently, many attempts based on prompt-learning have been made to improve the performance of text…

Computation and Language · Computer Science 2024-06-07 Chun Liu , Hongguang Zhang , Kainan Zhao , Xinghai Ju , Lin Yang

We present ComplexityNet, a streamlined language model designed for assessing task complexity. This model predicts the likelihood of accurate output by various language models, each with different capabilities. Our initial application of…

Computation and Language · Computer Science 2024-10-16 Henry Bae , Aghyad Deeb , Alex Fleury , Kehang Zhu

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

The proliferation of large language models (LLMs) is accelerating the integration of multimodal assistants into edge devices, where inference is executed under stringent latency and energy constraints, often exacerbated by intermittent…

Hardware Architecture · Computer Science 2026-01-29 Yanru Chen , Runyang Tian , Yue Pan , Zheyu Li , Weihong Xu , Tajana Rosing
‹ Prev 1 2 3 10 Next ›