Related papers: Two-Chains: High Performance Framework for Functio…

UCX Programming Interface for Remote Function Injection and Invocation

Network library APIs have historically been developed with the emphasis on data movement, placement, and communication semantics. Many communication semantics are available across a large variety of network libraries, such as send-receive,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-02 Luis E. Peña , Wenbin Lu , Pavel Shamis , Steve Poole

Performance Models for a Two-tiered Storage System

This work describes the design, implementation and performance analysis of a distributed two-tiered storage software. The first tier functions as a distributed software cache implemented using solid-state devices~(NVMes) and the second tier…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-13 Aparna Sasidharan , Xian-He , Jay Lofstead , Scott Klasky

Synch: A framework for concurrent data-structures and benchmarks

The recent advancements in multicore machines highlight the need to simplify concurrent programming in order to leverage their computational power. One way to achieve this is by designing efficient concurrent data structures (e.g. stacks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-31 Nikolaos D. Kallimanis

Relaxing Concurrent Data-structure Semantics for Increasing Performance: A Multi-structure 2D Design Framework

There has been a significant amount of work in the literature proposing semantic relaxation of concurrent data structures for improving scalability and performance. By relaxing the semantics of a data structure, a bigger design space, that…

Data Structures and Algorithms · Computer Science 2025-11-11 Adones Rukundo , Aras Atalar , Philippas Tsigas

Bring the BitCODE -- Moving Compute and Data in Distributed Heterogeneous Systems

In this paper, we present a framework for moving compute and data between processing elements in a distributed heterogeneous system. The implementation of the framework is based on the LLVM compiler toolchain combined with the UCX…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-13 Wenbin Lu , Luis E. Peña , Pavel Shamis , Valentin Churavy , Barbara Chapman , Steve Poole

HEP-BNN: A Framework for Finding Low-Latency Execution Configurations of BNNs on Heterogeneous Multiprocessor Platforms

Binarized Neural Networks (BNNs) significantly reduce the computation and memory demands with binarized weights and activations compared to full-precision NNs. Executing a layer in a BNN on different devices of a heterogeneous…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-13 Leonard David Bereholschi , Ching-Chi Lin , Mikail Yayla , Jian-Jia Chen

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Efficient parallelization of Large Language Models (LLMs) with long sequences is essential but challenging due to their significant computational and memory demands, particularly stemming from communication bottlenecks in attention…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-31 Zongwu Wang , Fangxin Liu , Mingshuai Li , Li Jiang

Towards Efficient Hash Maps in Functional Array Languages

We present a systematic derivation of a data-parallel implementation of two-level, static and collision-free hash maps, by giving a functional formulation of the Fredman et al. construction, and then flattening it. We discuss the challenges…

Programming Languages · Computer Science 2025-08-18 William Henrich Due , Martin Elsman , Troels Henriksen

Learning-based Dynamic Pinning of Parallelized Applications in Many-Core Systems

Motivated by the need for adaptive, secure and responsive scheduling in a great range of computing applications, including human-centered and time-critical applications, this paper proposes a scheduling framework that seamlessly adds…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-14 Georgios C. Chasparis , Vladimir Janjic , Michael Rossbory

Bilateral Network with Residual U-blocks and Dual-Guided Attention for Real-time Semantic Segmentation

When some application scenarios need to use semantic segmentation technology, like automatic driving, the primary concern comes to real-time performance rather than extremely high segmentation accuracy. To achieve a good trade-off between…

Computer Vision and Pattern Recognition · Computer Science 2023-11-01 Liang Liao , Liang Wan , Mingsheng Liu , Shusheng Li

Chunks and Tasks: a programming model for parallelization of dynamic algorithms

We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respectively. The Chunks and…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-07-29 Emanuel H. Rubensson , Elias Rudberg

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-09 Ying Yuan , Pengfei Zuo , Bo Wang , Zhangyu Chen , Zhipeng Tan , Zhou Yu

Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining

Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis (SCA). The "1-to-1" mechanism has been applied in existing binary2source matching works, in which one binary…

Software Engineering · Computer Science 2022-10-28 Ang Jia , Ming Fan , Xi Xu , Wuxia Jin , Haijun Wang , Qiyi Tang , Sen Nie , Shi Wu , Ting Liu

Optimal Compression for Two-Field Entries in Fixed-Width Memories

Data compression is a well-studied (and well-solved) problem in the setup of long coding blocks. But important emerging applications need to compress data to memory words of small fixed widths. This new setup is the subject of this paper.…

Information Theory · Computer Science 2017-01-12 Ori Rottenstreich , Yuval Cassuto

Bind: a Partitioned Global Workflow Parallel Programming Model

High Performance Computing is notorious for its long and expensive software development cycle. To address this challenge, we present Bind: a "partitioned global workflow" parallel programming model for C++ applications that enables quick…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-16 Alex Kosenkov , Matthias Troyer

Exploring the Relation Between Two Levels of Scheduling Using a Novel Simulation Approach

Modern high performance computing (HPC) systems exhibit a rapid growth in size, both "horizontally" in the number of nodes, as well as "vertically" in the number of cores per node. As such, they offer additional levels of hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-06 Ahmed Eleliemy , Ali Mohammed , Florina M. Ciorba

Building Blocks for Network-Accelerated Distributed File Systems

High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data…

Networking and Internet Architecture · Computer Science 2022-06-22 Salvatore Di Girolamo , Daniele De Sensi , Konstantin Taranov , Milos Malesevic , Maciej Besta , Timo Schneider , Severin Kistler , Torsten Hoefler

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

This work elaborates on a High performance computing (HPC) architecture based on Simple Linux Utility for Resource Management (SLURM) [1] for deploying heterogeneous Large Language Models (LLMs) into a scalable inference engine. Dynamic…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-26 Anderson de Lima Luiz , Shubham Vijay Kurlekar , Munir Georges

Exploiting the Structure of Two Graphs with Graph Neural Networks

Graph neural networks (GNNs) have emerged as a promising solution to deal with unstructured data, outperforming traditional deep learning architectures. However, most of the current GNN models are designed to work with a single graph, which…

Machine Learning · Computer Science 2024-11-11 Victor M. Tenorio , Antonio G. Marques

High-performance symbolic-numerics via multiple dispatch

As mathematical computing becomes more democratized in high-level languages, high-performance symbolic-numeric systems are necessary for domain scientists and engineers to get the best performance out of their machine without deep knowledge…

Computation and Language · Computer Science 2022-02-08 Shashi Gowda , Yingbo Ma , Alessandro Cheli , Maja Gwozdz , Viral B. Shah , Alan Edelman , Christopher Rackauckas