Related papers: LFM2 Technical Report

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2…

Computation and Language · Computer Science 2026-03-20 Ziyin Zhang , Zihan Liao , Hang Yu , Peng Di , Rui Wang

MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment

Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware…

Machine Learning · Computer Science 2026-04-29 Hanxian Huang , Igor Fedorov , Andrey Gromov , Bernard Beckerman , Naveen Suda , David Eriksson , Maximilian Balandat , Rylan Conway , Patrick Huber , Chinnadhurai Sankar , Ayushi Dalmia , Zechun Liu , Lemeng Wu , Tarek Elgamal , Adithya Sagar , Vikas Chandra , Raghuraman Krishnamoorthi

PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing

While scaling laws have been continuously validated in large language models (LLMs) with increasing model parameters, the inherent tension between the inference demands of LLMs and the limited resources of edge devices poses a critical…

Computation and Language · Computer Science 2025-03-20 Cheng Deng , Luoyang Sun , Jiwen Jiang , Yongcheng Zeng , Xinjian Wu , Wenxin Zhao , Qingfa Xiao , Jiachuan Wang , Haoyang Li , Lei Chen , Lionel M. Ni , Haifeng Zhang , Jun Wang

MiniCPM4: Ultra-Efficient LLMs on End Devices

This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data,…

Computation and Language · Computer Science 2025-09-05 MiniCPM Team , Chaojun Xiao , Yuxuan Li , Xu Han , Yuzhuo Bai , Jie Cai , Haotian Chen , Wentong Chen , Xin Cong , Ganqu Cui , Ning Ding , Shengda Fan , Yewei Fang , Zixuan Fu , Wenyu Guan , Yitong Guan , Junshao Guo , Yufeng Han , Bingxiang He , Yuxiang Huang , Baoxi Ji , Cunliang Kong , Qiuzuo Li , Siyuan Li , Wenhao Li , Xin Li , Yanghao Li , Yishan Li , Zhen Li , Dan Liu , Biyuan Lin , Yankai Lin , Xiang Long , Quanyu Lu , Yaxi Lu , Peiyan Luo , Hongya Lyu , Litu Ou , Yinxu Pan , Lushi Pu , Zekai Qu , Qundong Shi , Zijun Song , Jiayuan Su , Zhou Su , Ao Sun , Xianghui Sun , Peijun Tang , Fangzheng Wang , Feng Wang , Shuo Wang , Yudong Wang , Zheng Wang , Yesai Wu , Zhenyu Xiao , Jie Xie , Zihao Xie , Xiaoyue Xu , Yukun Yan , Jiarui Yuan , Jinqian Zhang , Kaihuo Zhang , Lei Zhang , Linyue Zhang , Xueren Zhang , Yudi Zhang , Hengyu Zhao , Weilin Zhao , Weilun Zhao , Yuanqian Zhao , Zhi Zheng , Chuyue Zhou , Ge Zhou , Jie Zhou , Wei Zhou , Yanghao Zhou , Zihan Zhou , Zixuan Zhou , Zhiyuan Liu , Guoyang Zeng , Chao Jia , Dahai Li , Maosong Sun

Fast On-device LLM Inference with NPUs

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably…

Artificial Intelligence · Computer Science 2024-12-17 Daliang Xu , Hao Zhang , Liming Yang , Ruiqi Liu , Gang Huang , Mengwei Xu , Xuanzhe Liu

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining,…

Computation and Language · Computer Science 2025-10-03 Ziyin Zhang , Zihan Liao , Hang Yu , Peng Di , Rui Wang

EdgeProfiler: A Fast Profiling Framework for Lightweight LLMs on Edge Using Analytical Model

This paper introduces EdgeProfiler, a fast profiling framework designed for evaluating lightweight Large Language Models (LLMs) on edge systems. While LLMs offer remarkable capabilities in natural language understanding and generation,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-18 Alyssa Pinnock , Shakya Jayakody , Kawsher A Roxy , Md Rubel Ahmed

lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models

Large Language Models (LLMs) are increasingly integrated into everyday applications, but their prevalent cloud-based deployment raises growing concerns around data privacy and long-term sustainability. Running LLMs locally on mobile and…

Machine Learning · Computer Science 2025-10-08 Haoxin Wang , Xiaolong Tu , Hongyu Ke , Huirong Chai , Dawei Chen , Kyungtae Han

TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs

With the emergence of wearable devices and other embedded systems, deploying large language models (LLMs) on edge platforms has become an urgent need. However, this is challenging because of their high computational and memory demands.…

Hardware Architecture · Computer Science 2025-10-22 Ye Qiao , Zhiheng Chen , Yifan Zhang , Yian Wang , Sitao Huang

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with…

Hardware Architecture · Computer Science 2025-04-28 Ye Qiao , Zhiheng Chen , Yifan Zhang , Yian Wang , Sitao Huang

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed -…

Machine Learning · Computer Science 2025-07-31 Yixin Song , Zhenliang Xue , Dongliang Wei , Feiyang Chen , Jianxiang Gao , Junchen Liu , Hangyu Liang , Guangshuo Qin , Chengrong Tian , Bo Wen , Longyu Zhao , Xinrui Zheng , Zeyu Mi , Haibo Chen

MobileLLM-Pro Technical Report

Efficient on-device language models around 1 billion parameters are essential for powering low-latency AI applications on mobile and wearable devices. However, achieving strong performance in this model class, while supporting long context…

Machine Learning · Computer Science 2025-11-11 Patrick Huber , Ernie Chang , Wei Wen , Igor Fedorov , Tarek Elgamal , Hanxian Huang , Naveen Suda , Chinnadhurai Sankar , Vish Vogeti , Yanghan Wang , Alex Gladkov , Kai Sheng Tai , Abdelrahman Elogeel , Tarek Hefny , Vikas Chandra , Ahmed Aly , Anuj Kumar , Raghuraman Krishnamoorthi , Adithya Sagar

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal…

Machine Learning · Computer Science 2026-05-19 Xinting Jiang , Junyi Luo , Ruichen Qi , Kauna Lei , Ben Laurie , Gregory Kielian , Mehdi Saligane

Tele-FLM Technical Report

Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on…

Computation and Language · Computer Science 2024-04-26 Xiang Li , Yiqun Yao , Xin Jiang , Xuezhi Fang , Chao Wang , Xinzhang Liu , Zihan Wang , Yu Zhao , Xin Wang , Yuyao Huang , Shuangyong Song , Yongxiang Li , Zheng Zhang , Bo Zhao , Aixin Sun , Yequan Wang , Zhongjiang He , Zhongyuan Wang , Xuelong Li , Tiejun Huang

Efficient LLM inference solution on Intel GPU

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with…

Hardware Architecture · Computer Science 2024-06-25 Hui Wu , Yi Gan , Feng Yuan , Jing Ma , Wei Zhu , Yutao Xu , Hong Zhu , Yuhua Zhu , Xiaoli Liu , Jinghui Gu , Peng Zhao

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow…

Computation and Language · Computer Science 2024-07-08 Luchang Li , Sheng Qian , Jie Lu , Lunxi Yuan , Rui Wang , Qin Xie

LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices

Large language models (LLMs) have emerged as a powerful foundation for intelligent reasoning and decision-making, demonstrating substantial impact across a wide range of domains and applications. However, their massive parameter scales and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-29 Mingyu Sun , Xiao Zhang , Shen Qu , Yan Li , Mengbai Xiao , Yuan Yuan , Dongxiao Yu

2 OLMo 2 Furious

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data,…

Computation and Language · Computer Science 2025-10-09 Team OLMo , Pete Walsh , Luca Soldaini , Dirk Groeneveld , Kyle Lo , Shane Arora , Akshita Bhagia , Yuling Gu , Shengyi Huang , Matt Jordan , Nathan Lambert , Dustin Schwenk , Oyvind Tafjord , Taira Anderson , David Atkinson , Faeze Brahman , Christopher Clark , Pradeep Dasigi , Nouha Dziri , Allyson Ettinger , Michal Guerquin , David Heineman , Hamish Ivison , Pang Wei Koh , Jiacheng Liu , Saumya Malik , William Merrill , Lester James V. Miranda , Jacob Morrison , Tyler Murray , Crystal Nam , Jake Poznanski , Valentina Pyatkin , Aman Rangapur , Michael Schmitz , Sam Skjonsberg , David Wadden , Christopher Wilhelm , Michael Wilson , Luke Zettlemoyer , Ali Farhadi , Noah A. Smith , Hannaneh Hajishirzi

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

Extremely low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2 bits and even at 4 bits (e.g., MXFP4). We present SignRoundV2, a post-training…

Computation and Language · Computer Science 2026-05-19 Wenhua Cheng , Weiwei Zhang , Heng Guo , Haihao Shen , Zaner Ma

Lightweight Transformer Architectures for Edge Devices in Real-Time Applications

The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer…

Machine Learning · Computer Science 2026-01-08 Hema Hariharan Samson