In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7× and 3.0× and the execution memory by 3.5× and 4.3×, respectively. MCUBERT also achieves 1.5× latency reduction. For the first time, MCUBERT enables lightweight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.
@article{arxiv.2410.17957,
title = {MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers},
author = {Zebin Yang and Renze Chen and Taiqiang Wu and Ngai Wong and Yun Liang and Runsheng Wang and Ru Huang and Meng Li},
journal= {arXiv preprint arXiv:2410.17957},
year = {2024}
}