English

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Machine Learning 2024-10-24 v1 Artificial Intelligence

Abstract

In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7×\times and 3.0×\times and the execution memory by 3.5×\times and 4.3×\times, respectively. MCUBERT also achieves 1.5×\times latency reduction. For the first time, MCUBERT enables lightweight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.

Keywords

Cite

@article{arxiv.2410.17957,
  title  = {MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers},
  author = {Zebin Yang and Renze Chen and Taiqiang Wu and Ngai Wong and Yun Liang and Runsheng Wang and Ru Huang and Meng Li},
  journal= {arXiv preprint arXiv:2410.17957},
  year   = {2024}
}

Comments

ICCAD 2024

R2 v1 2026-06-28T19:33:01.166Z