Related papers: VELO: A Vector Database-Assisted Cloud-Edge Collab…

Queue management for slo-oriented large language model serving

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-26 Archit Patke , Dhemath Reddy , Saurabh Jha , Haoran Qiu , Christian Pinto , Chandra Narayanaswami , Zbigniew Kalbarczyk , Ravishankar Iyer

A Hybrid Swarm Intelligence Approach for Optimizing Multimodal Large Language Models Deployment in Edge-Cloud-based Federated Learning Environments

The combination of Federated Learning (FL), Multimodal Large Language Models (MLLMs), and edge-cloud computing enables distributed and real-time data processing while preserving privacy across edge devices and cloud infrastructure. However,…

Neural and Evolutionary Computing · Computer Science 2025-02-19 Gaith Rjouba , Hanae Elmekki , Saidul Islam , Jamal Bentahar , Rachida Dssouli

Orchestration for Domain-specific Edge-Cloud Language Models

The remarkable performance of Large Language Models (LLMs) has inspired many applications, which often necessitate edge-cloud collaboration due to connectivity, privacy, and cost considerations. Traditional methods primarily focus on…

Databases · Computer Science 2025-07-15 Prasoon Patidar , Alex Crown , Kevin Hsieh , Yifei Xu , Tusher Chakraborty , Ranveer Chandra , Yuvraj Agarwal

EACO-RAG: Towards Distributed Tiered LLM Deployment using Edge-Assisted and Collaborative RAG with Adaptive Knowledge Update

Large language models (LLMs) have demonstrated impressive capabilities in language tasks, but they require high computing power and rely on static knowledge. To overcome these limitations, Retrieval-Augmented Generation (RAG) incorporates…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-17 Jiaxing Li , Chi Xu , Lianchen Jia , Feng Wang , Cong Zhang , Jiangchuan Liu

Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing

Running Large Language Models (LLMs) on edge devices is crucial for reducing latency, improving real-time processing, and enhancing privacy. By performing inference directly on the device, data does not need to be sent to the cloud,…

Hardware Architecture · Computer Science 2025-10-21 Tianhua Xia , Sai Qian Zhang

VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

Large Language Models (LLMs) excel in natural language processing tasks but pose significant computational and memory challenges for edge deployment due to their intensive resource demands. This work addresses the efficiency of LLM…

Hardware Architecture · Computer Science 2025-07-02 Zhican Wang , Hongxiang Fan , Haroon Waris , Gang Wang , Zhenyu Li , Jianfei Jiang , Yanan Sun , Guanghui He

CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration

Emerging intelligent service scenarios in 6G communication impose stringent requirements for low latency, high reliability, and privacy preservation. Generative large language models (LLMs) are gradually becoming key enablers for the…

Networking and Internet Architecture · Computer Science 2025-05-21 Pengyan Zhu , Tingting Yang

Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts

Large Language Models (LLMs) have demonstrated remarkable capabilities, leading to a significant increase in user demand for LLM services. However, cloud-based LLM services often suffer from high latency, unstable responsiveness, and…

Networking and Internet Architecture · Computer Science 2025-08-04 Jin Yang , Qiong Wu , Zhiying Feng , Zhi Zhou , Deke Guo , Xu Chen

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration

Large Language Models (LLMs) exhibit remarkable human-like predictive capabilities. However, it is challenging to deploy LLMs to provide efficient and adaptive inference services at the edge. This paper proposes a novel Cloud-Edge…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-10 Hongpeng Jin , Yanzhao Wu

Efficient Memory Management for Large Language Model Serving with PagedAttention

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks…

Machine Learning · Computer Science 2023-09-13 Woosuk Kwon , Zhuohan Li , Siyuan Zhuang , Ying Sheng , Lianmin Zheng , Cody Hao Yu , Joseph E. Gonzalez , Hao Zhang , Ion Stoica

Can Large Language Models Be Trusted as Evolutionary Optimizers for Network-Structured Combinatorial Problems?

Large Language Models (LLMs) have shown strong capabilities in language understanding and reasoning across diverse domains. Recently, there has been increasing interest in utilizing LLMs not merely as assistants in optimization tasks, but…

Neural and Evolutionary Computing · Computer Science 2025-10-10 Jie Zhao , Tao Wen , Kang Hao Cheong

Vision-Language Models for Edge Networks: A Comprehensive Survey

Vision Large Language Models (VLMs) combine visual understanding with natural language processing, enabling tasks like image captioning, visual question answering, and video analysis. While VLMs show impressive capabilities across domains…

Computer Vision and Pattern Recognition · Computer Science 2025-06-18 Ahmed Sharshar , Latif U. Khan , Waseem Ullah , Mohsen Guizani

A Survey on Large Language Model Acceleration based on KV Cache Management

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the…

Artificial Intelligence · Computer Science 2025-07-31 Haoyang Li , Yiming Li , Anxin Tian , Tianhao Tang , Zhanchao Xu , Xuejia Chen , Nicole Hu , Wei Dong , Qing Li , Lei Chen

Mobile Edge Intelligence for Large Language Models: A Contemporary Survey

On-device large language models (LLMs), referring to running LLMs on edge devices, have raised considerable interest since they are more cost-effective, latency-efficient, and privacy-preserving compared with the cloud paradigm.…

Networking and Internet Architecture · Computer Science 2025-03-21 Guanqiao Qu , Qiyuan Chen , Wei Wei , Zheng Lin , Xianhao Chen , Kaibin Huang

Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which…

Robotics · Computer Science 2025-05-29 Yeshwanth Venkatesha , Souvik Kundu , Priyadarshini Panda

ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput

The growing adoption of Large Language Models (LLMs) across various domains has driven the demand for efficient and scalable AI-serving solutions. Deploying LLMs requires optimizations to manage their significant computational and data…

Hardware Architecture · Computer Science 2025-03-07 Junsoo Kim , Hunjong Lee , Geonwoo Ko , Gyubin Choi , Seri Ham , Seongmin Hong , Joo-Young Kim

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-23 Jiale Xu , Rui Zhang , Cong Guo , Weiming Hu , Zihan Liu , Feiyang Wu , Yu Feng , Shixuan Sun , Changxu Shao , Yuhong Guo , Junping Zhao , Ke Zhang , Minyi Guo , Jingwen Leng

Batch Query Processing and Optimization for Agentic Workflows

Large Language Models (LLMs) in agentic workflows combine multi-step reasoning, heterogeneous tool use, and collaboration across multiple specialized agents. Existing LLM serving engines optimize individual calls in isolation, while…

Databases · Computer Science 2026-01-21 Junyi Shen , Noppanat Wadlom , Yao Lu

CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands

Distributed prefix caching has become a core technique for efficient LLM serving. However, for long-context requests with high cache hit ratios, retrieving reusable KVCache blocks from remote servers has emerged as a new performance…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-24 Weiye Wang , Chen Chen , Junxue Zhang , Zhusheng Wang , Hui Yuan , Zixuan Guan , Xiaolong Zheng , Qizhen Weng , Yin Chen , Minyi Guo

Can Large Language Models Be Query Optimizer for Relational Databases?

Query optimization, which finds the optimized execution plan for a given query, is a complex planning and decision-making problem within the exponentially growing plan space in database management systems (DBMS). Traditional optimizers…

Databases · Computer Science 2025-02-11 Jie Tan , Kangfei Zhao , Rui Li , Jeffrey Xu Yu , Chengzhi Piao , Hong Cheng , Helen Meng , Deli Zhao , Yu Rong