Related papers: CPMA: An Efficient Batch-Parallel Compressed Set W…
Many modern programming languages are shifting toward a functional style for collection interfaces such as sets, maps, and sequences. Functional interfaces offer many advantages, including being safe for parallelism and providing simple and…
The ever-increasing demand for 3D modeling in the emerging immersive applications has made point clouds an essential class of data for 3D image and video processing. Tree based structures are commonly used for representing point clouds…
In an $(H,r)$ combination network, a single content library is delivered to ${H\choose r}$ users through deployed $H$ relays without cache memories, such that each user with local cache memories is simultaneously served by a different…
This paper focuses on reducing memory usage in enumerative model checking, while maintaining the multi-core scalability obtained in earlier work. We present a tree-based multi-core compression method, which works by leveraging sharing among…
Due to the dynamic nature of real-world graphs, there has been a growing interest in the graph-streaming setting where a continuous stream of graph updates is mixed with arbitrary graph queries. In principle, purely-functional trees are an…
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such "nested" functions is the…
Dynamic tree data structures maintain a forest while supporting insertion and deletion of edges and a broad set of queries in $O(\log n)$ time per operation. Such data structures are at the core of many modern algorithms. Recent work has…
The attention mechanism requires huge computational efforts to process unnecessary calculations, significantly limiting the system's performance. Researchers propose sparse attention to convert some DDMM operations to SDDMM and SpMM…
The growing volume of data in modern applications has led to significant computational costs in conventional processor-centric systems. Processing-in-memory (PIM) architectures alleviate these costs by moving computation closer to memory,…
Joint compression of point cloud geometry and attributes is essential for efficient 3D data representation. Existing methods often rely on post-hoc recoloring procedures and manually tuned bitrate allocation between geometry and attribute…
Existing point cloud completion methods struggle to balance high-quality reconstruction with computational efficiency. To address this, we propose PPC-MT, a novel parallel framework for point cloud completion leveraging a hybrid…
Memory disaggregation architecture physically separates CPU and memory into independent components, which are connected via high-speed RDMA networks, greatly improving resource utilization of databases. However, such an architecture poses…
Subgraph matching is a basic operation widely used in many applications. However, due to its NP-hardness and the explosive growth of graph data, it is challenging to compute subgraph matching, especially in large graphs. In this paper, we…
Carrier Sense Multiple Access (CSMA) is widely used as a Medium Access Control (MAC) in wireless networks due to its simplicity and distributed nature. This motivated researchers to find CSMA schemes that achieve throughput optimality. In…
Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache,…
Coded caching scheme, which is an effective technique to increase the transmission efficiency during peak traffic times, has recently become quite popular among the coding community. Generally rate can be measured to the transmission in the…
Previous partial permutation synchronization (PPS) algorithms, which are commonly used for multi-object matching, often involve computation-intensive and memory-demanding matrix operations. These operations become intractable for large…
Caching is an efficient technique to reduce peak traffic by storing popular content in local caches. Placement delivery array (PDA) proposed by Yan et al. is a combinatorial structure to design coded caching schemes with uncoded placement…
We propose a parallel graph-based data clustering algorithm using CUDA GPU, based on exact clustering of the minimum spanning tree in terms of a minimum isoperimetric criteria. We also provide a comparative performance analysis of our…
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of processing elements…