Related papers: SimpleTool: Parallel Decoding for Real-Time LLM Fu…

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response…

Artificial Intelligence · Computer Science 2026-05-29 Kou Shi , Ziao Zhang , Shiting Huang , Avery Nie , Zhen Fang , Qiuchen Wang , Lin Chen , Huaian Chen , Zehui Chen , Feng Zhao

RelayLLM: Efficient Reasoning via Collaborative Decoding

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative…

Computation and Language · Computer Science 2026-01-09 Chengsong Huang , Tong Zheng , Langlin Huang , Jinyuan Li , Haolin Liu , Jiaxin Huang

An LLM-Tool Compiler for Fused Parallel Function Calling

State-of-the-art sequential reasoning in Large Language Models (LLMs) has expanded the capabilities of Copilots beyond conversational tasks to complex function calling, managing thousands of API calls. However, the tendency of compositional…

Programming Languages · Computer Science 2024-05-29 Simranjit Singh , Andreas Karatzas , Michael Fore , Iraklis Anagnostopoulos , Dimitrios Stamoulis

Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes,…

Computation and Language · Computer Science 2026-05-15 Guangyu Feng , Huanzhi Mao , Prabal Dutta , Joseph E. Gonzalez

An LLM Compiler for Parallel Function Calling

The reasoning capabilities of the recent LLMs enable them to execute external function calls to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has…

Computation and Language · Computer Science 2024-06-06 Sehoon Kim , Suhong Moon , Ryan Tabrizi , Nicholas Lee , Michael W. Mahoney , Kurt Keutzer , Amir Gholami

ToolGen: Unified Tool Retrieval and Calling via Generation

As large language models (LLMs) advance, their inability to autonomously execute tasks by directly interacting with external tools remains a critical limitation. Traditional methods rely on inputting tool descriptions as context, which is…

Computation and Language · Computer Science 2025-04-01 Renxi Wang , Xudong Han , Lei Ji , Shu Wang , Timothy Baldwin , Haonan Li

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass…

Computation and Language · Computer Science 2024-08-26 Quandong Wang , Yuxuan Yuan , Xiaoyu Yang , Ruike Zhang , Kang Zhao , Wei Liu , Jian Luan , Daniel Povey , Bin Wang

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these…

Machine Learning · Computer Science 2025-10-01 Hao Mark Chen , Wayne Luk , Ka Fai Cedric Yiu , Rui Li , Konstantin Mishchenko , Stylianos I. Venieris , Hongxiang Fan

dParallel: Learnable Parallel Decoding for dLLMs

Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet,…

Computation and Language · Computer Science 2025-10-01 Zigeng Chen , Gongfan Fang , Xinyin Ma , Ruonan Yu , Xinchao Wang

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-31 Zongwu Wang , Fangxin Liu , Mingshuai Li , Li Jiang

FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees

Finetuning large language models (LLMs) is essential for task adaptation, yet today's serving stacks isolate inference and finetuning on separate GPU clusters -- wasting resources and under-utilizing hardware. We introduce FlexLLM, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-27 Gabriele Oliaro , Xupeng Miao , Xinhao Cheng , Vineeth Kada , Mengdi Wu , Ruohan Gao , Yingyi Huang , Remi Delacourt , April Yang , Yingcheng Wang , Colin Unger , Zhihao Jia

From Prompts to Performance: Evaluating LLMs for Task-based Parallel Code Generation

Large Language Models (LLM) show strong abilities in code generation, but their skill in creating efficient parallel programs is less studied. This paper explores how LLMs generate task-based parallel code from three kinds of input prompts:…

Programming Languages · Computer Science 2026-02-27 Linus Bantel , Moritz Strack , Alexander Strack , Dirk Pflüger

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference…

Computation and Language · Computer Science 2025-06-03 Guoxuan Chen , Han Shi , Jiawei Li , Yihang Gao , Xiaozhe Ren , Yimeng Chen , Xin Jiang , Zhenguo Li , Weiyang Liu , Chao Huang

SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding

Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-16 Ziyi Zhang , Ziheng Jiang , Chengquan Jiang , Menghan Yu , Size Zheng , Haibin Lin , Henry Hoffmann , Xin Liu

Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

Large Language Models (LLMs) have exhibited significant potential in performing diverse tasks, including the ability to call functions or use external tools to enhance their performance. While current research on function calling by LLMs…

Computation and Language · Computer Science 2025-03-04 Mingyang Chen , Haoze Sun , Tianpeng Li , Fan Yang , Hao Liang , Keer Lu , Bin Cui , Wentao Zhang , Zenan Zhou , Weipeng Chen

Generalizable End-to-End Tool-Use RL with Synthetic CodeGym

Tool-augmented large language models (LLMs), hereafter LLM agents, leverage external tools to solve diverse tasks and interface with the real world. However, current training practices largely rely on supervised fine-tuning (SFT) over…

Machine Learning · Computer Science 2026-03-18 Weihua Du , Hailei Gong , Zhan Ling , Kang Liu , Lingfeng Shen , Xuesong Yao , Yufei Xu , Dingyuan Shi , Yiming Yang , Jiecao Chen

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Tandem Transformers for Inference Efficient LLMs

The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face…

Artificial Intelligence · Computer Science 2024-10-22 Aishwarya P S , Pranav Ajit Nair , Yashas Samaga , Toby Boyd , Sanjiv Kumar , Prateek Jain , Praneeth Netrapalli

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks…

Artificial Intelligence · Computer Science 2026-05-26 Yijuan Liang , Xinghao Chen , Yifan Ge , Ziyi Wu , Hao Wu , Changyu Zeng , Wei Xing , Xiaoyu Shen

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-19 Jiaao He , Jidong Zhai