MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Zebin Yang; Renze Chen; Taiqiang Wu; Ngai Wong; Yun Liang; Runsheng Wang; Ru Huang; Meng Li

doi:10.1145/3676536.3676747

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Machine Learning 2024-10-24 v1 Artificial Intelligence

Authors: Zebin Yang , Renze Chen , Taiqiang Wu , Ngai Wong , Yun Liang , Runsheng Wang , Ru Huang , Meng Li

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7 $\times$ and 3.0 $\times$ and the execution memory by 3.5 $\times$ and 4.3 $\times$ , respectively. MCUBERT also achieves 1.5 $\times$ latency reduction. For the first time, MCUBERT enables lightweight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.

Keywords

bert processing-in-memory memory hierarchy

Cite

@article{arxiv.2410.17957,
  title  = {MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers},
  author = {Zebin Yang and Renze Chen and Taiqiang Wu and Ngai Wong and Yun Liang and Runsheng Wang and Ru Huang and Meng Li},
  journal= {arXiv preprint arXiv:2410.17957},
  year   = {2024}
}

Comments

ICCAD 2024

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Abstract

Keywords

Cite

Comments

Related papers