Related papers: LMStream: When Distributed Micro-Batch Stream Proc…

Optimizing LLM Inference Throughput via Memory-aware and SLA-constrained Dynamic Batching

The increasing adoption of large language models (LLMs) necessitates inference serving systems that can deliver both high throughput and low latency. Deploying LLMs with hundreds of billions of parameters on memory-constrained GPUs exposes…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-10 Bowen Pang , Kai Li , Feifan Wang

Benchmarking Distributed Stream Data Processing Systems

The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to…

Databases · Computer Science 2019-06-27 Jeyhun Karimov , Tilmann Rabl , Asterios Katsifodimos , Roman Samarev , Henri Heiskanen , Volker Markl

The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution

While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-23 Frank Sifei Luan , Ron Yifeng Wang , Yile Gu , Ziming Mao , Charlotte Lin , Amog Kamsetty , Hao Chen , Cheng Su , Balaji Veeramani , Scott Lee , SangBin Cho , Clark Zinzow , Eric Liang , Ion Stoica , Stephanie Wang

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving

Efficient LLM serving must balance throughput and latency across diverse, bursty workloads. We introduce StreamServe, a disaggregated prefill decode serving architecture that combines metric aware routing across compute lanes with adaptive…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Satyam Kumar , Arpit Singh Gautam , Kailash Talreja , Saurabh Jha

Streaming Hypergraph Partitioning Algorithms on Limited Memory Environments

Many well-known, real-world problems involve dynamic data which describe the relationship among the entities. Hypergraphs are powerful combinatorial structures that are frequently used to model such data. For many of today's data-centric…

Data Structures and Algorithms · Computer Science 2021-03-10 Fatih Taşyaran , Berkay Demireller , Kamer Kaya , Bora Uçar

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

Real-time LLM interactions demand streamed token generations, where text tokens are progressively generated and delivered to users while balancing two objectives: responsiveness (i.e., low time-to-first-token) and steady generation…

Machine Learning · Computer Science 2025-10-06 Junyi Chen , Chuheng Du , Renyuan Liu , Shuochao Yao , Dingtian Yan , Jiang Liao , Shengzhong Liu , Fan Wu , Guihai Chen

Towards a Multimodal Stream Processing System

In this paper, we present a vision for a new generation of multimodal streaming systems that embed MLLMs as first-class operators, enabling real-time query processing across multiple modalities. Achieving this is non-trivial: while recent…

Databases · Computer Science 2025-11-12 Uélison Jean Lopes dos Santos , Alessandro Ferri , Szilard Nistor , Riccardo Tommasini , Carsten Binnig , Manisha Luthra

StreamChat: Chatting with Streaming Video

This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Jihao Liu , Zhiding Yu , Shiyi Lan , Shihao Wang , Rongyao Fang , Jan Kautz , Hongsheng Li , Jose M. Alvare

Scheduling of Intermittent Query Processing

Stream processing is usually done either on a tuple-by-tuple basis or in micro-batches. There are many applications where tuples over a predefined duration/window must be processed within certain deadlines. Processing such queries using…

Databases · Computer Science 2024-09-23 Saranya Chandrasekaran , S. Sudarshan

Exploring the Landscape of Distributed Graph Sketching

Recent work has initiated the study of dense graph processing using graph sketching methods, which drastically reduce space costs by lossily compressing information about the input graph. In this paper, we explore the strange and surprising…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-18 David Tench , Evan T. West , Kenny Zhang , Michael Bender , Daniel DeLayo , Martin Farach-Colton , Gilvir Gill , Tyler Seip , Victor Zhang

Approximate Stream Analytics in Apache Flink and Apache Spark Streaming

Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-12 Do Le Quoc , Ruichuan Chen , Pramod Bhatotia , Christof Fetze , Volker Hilt , Thorsten Strufe

Scaling Ordered Stream Processing on Shared-Memory Multicores

Many modern applications require real-time processing of large volumes of high-speed data. Such data processing needs can be modeled as a streaming computation. A streaming computation is specified as a dataflow graph that exposes multiple…

Databases · Computer Science 2018-04-02 Guna Prasaad , G. Ramalingam , Kaushik Rajan

SMDP-Based Dynamic Batching for Efficient Inference on GPU-Based Platforms

In up-to-date machine learning (ML) applications on cloud or edge computing platforms, batching is an important technique for providing efficient and economical services at scale. In particular, parallel computing resources on the…

Machine Learning · Computer Science 2023-09-04 Yaodan Xu , Jingzhou Sun , Sheng Zhou , Zhisheng Niu

Towards Concurrent Stateful Stream Processing on Multicore Processors (Technical Report)

Recent data stream processing systems (DSPSs) can achieve excellent performance when processing large volumes of data under tight latency constraints. However, they sacrifice support for concurrent state access that eases the burden of…

Databases · Computer Science 2023-06-21 Shuhao Zhang , Yingjun Wu , Feng Zhang , Bingsheng He

Streaming Graph Algorithms in the Massively Parallel Computation Model

We initiate the study of graph algorithms in the streaming setting on massive distributed and parallel systems inspired by practical data processing systems. The objective is to design algorithms that can efficiently process evolving graphs…

Data Structures and Algorithms · Computer Science 2025-01-20 Artur Czumaj , Gopinath Mishra , Anish Mukherjee

StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs

Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve…

Hardware Architecture · Computer Science 2025-09-24 Hanchen Ye , Deming Chen

CStream: Parallel Data Stream Compression on Multicore Edge Devices

In the burgeoning realm of Internet of Things (IoT) applications on edge devices, data stream compression has become increasingly pertinent. The integration of added compression overhead and limited hardware resources on these devices calls…

Databases · Computer Science 2024-06-18 Xianzhi Zeng , Shuhao Zhang

ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks

Distributed stream processing frameworks help building scalable and reliable applications that perform transformations and aggregations on continuous data streams. This paper introduces ShuffleBench, a novel benchmark to evaluate the…

Software Engineering · Computer Science 2024-03-08 Sören Henning , Adriano Vogel , Michael Leichtfried , Otmar Ertl , Rick Rabiser

Resource- and Message Size-Aware Scheduling of Stream Processing at the Edge with application to Realtime Microscopy

Whilst computational resources at the cloud edge can be leveraged to improve latency and reduce the costs of cloud services for a wide variety mobile, web, and IoT applications; such resources are naturally constrained. For distributed…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-20 Ben Blamey , Ida-Maria Sintorn , Andreas Hellander , Salman Toor

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction.…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Akio Kodaira , Chenfeng Xu , Toshiki Hazama , Takanori Yoshimoto , Kohei Ohno , Shogo Mitsuhori , Soichi Sugano , Hanying Cho , Zhijian Liu , Masayoshi Tomizuka , Kurt Keutzer