Related papers: CITER: Collaborative Inference for Efficient Large…

SATER: A Self-Aware and Token-Efficient Approach to Routing and Cascading

Large language models (LLMs) demonstrate remarkable performance across diverse tasks, yet their effectiveness frequently depends on costly commercial APIs or cloud services. Model selection thus entails a critical trade-off between…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-08 Yuanzhe Shen , Yide Liu , Zisu Huang , Ruicheng Yin , Xiaoqing Zheng , Xuanjing Huang

RelayLLM: Efficient Reasoning via Collaborative Decoding

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative…

Computation and Language · Computer Science 2026-01-09 Chengsong Huang , Tong Zheng , Langlin Huang , Jinyuan Li , Haolin Liu , Jiaxin Huang

Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models…

Computation and Language · Computer Science 2024-09-24 Adarsh MS , Jithin VG , Ditto PS

Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively…

Computation and Language · Computer Science 2025-10-17 Chao Han , Yijuan Liang , Zihao Xuan , Daokuan Wu , Wei Zhang , Xiaoyu Shen

Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models

The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing…

Computation and Language · Computer Science 2023-11-16 Keming Lu , Hongyi Yuan , Runji Lin , Junyang Lin , Zheng Yuan , Chang Zhou , Jingren Zhou

LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token…

Computation and Language · Computer Science 2026-05-11 Xuan Li , Yining Wang , Yuchen Liu , Guanjun Liu , Delai Qiu , Shengping Liu , Jiaen Liang , Wei Huang , Jun Yu , Junnan Zhu

Token Level Routing Inference System for Edge Devices

The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often…

Computation and Language · Computer Science 2025-04-11 Jianshu She , Wenhao Zheng , Zhengzhong Liu , Hongyi Wang , Eric Xing , Huaxiu Yao , Qirong Ho

Fast Thinking for Large Language Models

Reasoning-oriented Large Language Models (LLMs) often rely on generating explicit tokens step by step, and their effectiveness typically hinges on large-scale supervised fine-tuning or reinforcement learning. While Chain-of-Thought (CoT)…

Computation and Language · Computer Science 2025-09-30 Haoyu Zheng , Zhuonan Wang , Yuqian Yuan , Tianwei Lin , Wenqiao Zhang , Zheqi Lv , Juncheng Li , Siliang Tang , Yueting Zhuang , Hongyang He

Confidence-Guided Stepwise Model Routing for Cost-Efficient Reasoning

Recent advances in Large Language Models (LLMs) - particularly model scaling and test-time techniques - have greatly enhanced the reasoning capabilities of language models at the expense of higher inference costs. To lower inference costs,…

Computation and Language · Computer Science 2025-11-21 Sangmook Lee , Dohyung Kim , Hyukhun Koh , Nakyeong Yang , Kyomin Jung

Efficient Inference for Large Reasoning Models: A Survey

Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason, exhibiting promising performance in solving complex tasks. However, their deliberative reasoning process leads…

Computation and Language · Computer Science 2025-08-14 Yue Liu , Jiaying Wu , Yufei He , Ruihan Gong , Jun Xia , Liang Li , Hongcheng Gao , Hongyu Chen , Baolong Bi , Jiaheng Zhang , Zhiqi Huang , Bryan Hooi , Stan Z. Li , Keqin Li

Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference

Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks but come with substantial energy and computational costs, particularly in request-heavy scenarios. In many real-world applications, the full scale and…

Computation and Language · Computer Science 2026-03-24 Patrick Wilhelm , Thorsten Wittkopp , Odej Kao

ICL-Router: In-Context Learned Model Representations for LLM Routing

Large language models (LLMs) often exhibit complementary strengths. Model routing harnesses these strengths by dynamically directing each query to the most suitable model, given a candidate model pool. However, routing performance relies on…

Machine Learning · Computer Science 2025-11-17 Chenxu Wang , Hao Li , Yiqun Zhang , Linyao Chen , Jianhao Chen , Ping Jian , Peng Ye , Qiaosheng Zhang , Shuyue Hu

RACER: Risk-Aware Calibrated Efficient Routing for Large Language Models

Efficiently routing queries to the optimal large language model (LLM) is crucial for optimizing the cost-performance trade-off in multi-model systems. However, most existing routers rely on single-model selection, making them susceptible to…

Machine Learning · Computer Science 2026-03-10 Sai Hao , Hao Zeng , Hongxin Wei , Bingyi Jing

RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment

Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and…

Computation and Language · Computer Science 2026-04-27 Yingfeng Luo , Hongyu Liu , Dingyang Lin , Kaiyan Chang , Chenglong Wang , Bei Li , Quan Du , Tong Xiao , Jingbo Zhu

HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose…

Computation and Language · Computer Science 2025-11-14 Nikunj Gupta , Bill Guo , Rajgopal Kannan , Viktor K. Prasanna

Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing

Large language models (LLMs) excel in most NLP tasks but also require expensive cloud servers for deployment due to their size, while smaller models that can be deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of…

Machine Learning · Computer Science 2024-04-24 Dujian Ding , Ankur Mallick , Chi Wang , Robert Sim , Subhabrata Mukherjee , Victor Ruhle , Laks V. S. Lakshmanan , Ahmed Hassan Awadallah

TensorOpera Router: A Multi-Model Router for Efficient LLM Inference

With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective…

Artificial Intelligence · Computer Science 2024-10-25 Dimitris Stripelis , Zijian Hu , Jipeng Zhang , Zhaozhuo Xu , Alay Dilipbhai Shah , Han Jin , Yuhang Yao , Salman Avestimehr , Chaoyang He

ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models

Large language models (LLMs) deliver impressive capabilities but incur substantial inference latency and cost, which hinders their deployment in latency-sensitive and resource-constrained scenarios. Cloud-edge-device collaborative inference…

Artificial Intelligence · Computer Science 2026-03-24 Haoyu Qiao , Hao Zhang , Shanwen Mao , Siyao Cheng , Jie Liu

When to Reason: Semantic Router for vLLM

Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token…

Emerging Technologies · Computer Science 2025-10-13 Chen Wang , Xunzhuo Liu , Yuhan Liu , Yue Zhu , Xiangxi Mo , Junchen Jiang , Huamin Chen

SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models

Large Language Models (LLMs) present a critical trade-off between inference quality and computational cost: larger models offer superior capabilities but incur significant latency, while smaller models are faster but less powerful. Existing…

Machine Learning · Computer Science 2025-05-13 Hang Wu , Jianian Zhu , Yinghui Li , Haojie Wang , Biao Hou , Jidong Zhai