English
Related papers

Related papers: FlashSampling: Fast and Memory-Efficient Exact Sam…

200 papers

As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-02 Bohan Zhao , Zane Cao , Yongchao He

Late-interaction retrieval (ColBERT, ColPali) scores a query against a document with the MaxSim operator: for every query token, the maximum similarity over the document tokens, summed over query tokens. The standard implementation…

Information Retrieval · Computer Science 2026-05-29 Roi Pony , Adi Raz Goldfarb , Idan Friedman , Daniel Ezer , Udi Barzelay

With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-20 Haojun Xia , Zhen Zheng , Yuchao Li , Donglin Zhuang , Zhongzhu Zhou , Xiafei Qiu , Yong Li , Wei Lin , Shuaiwen Leon Song

As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation…

Machine Learning · Computer Science 2024-01-08 Ke Hong , Guohao Dai , Jiaming Xu , Qiuli Mao , Xiuhong Li , Jun Liu , Kangdi Chen , Yuhan Dong , Yu Wang

Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing…

Machine Learning · Computer Science 2026-04-28 Nils Graef , Filip Makraduli , Andrew Wasielewski , Matthew Clapp

Efficiently solving large-scale linear systems is a critical challenge in electromagnetic simulations, particularly when using the Crank-Nicolson Finite-Difference Time-Domain (CN-FDTD) method. Existing iterative solvers are commonly…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-24 Haoyuan Zhang , Yaqian Gao , Xinxin Zhang , Jialin Li , Runfeng Jin , Yidong Chen , Feng Zhang , Wu Yuan , Wenpeng Ma , Shan Liang , Jian Zhang , Zhonghua Lu

Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Bo Tong , Bokai Lai , Yiyi Zhou , Gen Luo , Yunhang Shen , Ke Li , Xiaoshuai Sun , Rongrong Ji

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding…

Machine Learning · Computer Science 2024-02-06 Yichao Fu , Peter Bailis , Ion Stoica , Hao Zhang

High-performance learned image compression codecs require flexible probability models to fit latent representations. Gaussian Mixture Models (GMMs) were proposed to satisfy this demand, but suffer from a significant runtime performance…

Image and Video Processing · Electrical Eng. & Systems 2025-09-24 Shimon Murai , Fangzheng Lin , Jiro Katto

The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Ziyu Huang , Yangjie Zhou , Zihan Liu , Xinhao Luo , Yijia Diao , Minyi Guo , Jidong Zhai , Yu Feng , Chen Zhang , Anbang Wu , Jingwen Leng

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-17 Jinliang Shi , Shigang Li , Youxuan Xu , Rongtian Fu , Xueying Wang , Tong Wu

Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment. While recent methods…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-04 Huadai Liu , Jialei Wang , Rongjie Huang , Yang Liu , Heng Lu , Zhou Zhao , Wei Xue

Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification…

Machine Learning · Computer Science 2026-03-17 Wilhelm Tranheden , Shahnawaz Ahmed , Devdatt Dubhashi , Jonna Matthiesen , Hannes von Essen

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the…

Computation and Language · Computer Science 2025-10-28 Jiayi Yuan , Hao Li , Xinheng Ding , Wenya Xie , Yu-Jhe Li , Wentian Zhao , Kun Wan , Jing Shi , Xia Hu , Zirui Liu

Generating chemically valid 3D molecular conformations is critical for computational drug discovery. Classical diffusion-based models like GeoLDM perform well but require hundreds of steps, making large-scale in silico screening…

Machine Learning · Computer Science 2026-05-11 Xinyuan Wei , Zian Li , Shaoheng Yan , Cai Zhou , Muhan Zhang

Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have…

Computation and Language · Computer Science 2024-12-19 Zhuoming Chen , Ranajoy Sadhukhan , Zihao Ye , Yang Zhou , Jianyu Zhang , Niklas Nolte , Yuandong Tian , Matthijs Douze , Leon Bottou , Zhihao Jia , Beidi Chen

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream…

Computation and Language · Computer Science 2025-10-10 Zhanqiu Hu , Jian Meng , Yash Akhauri , Mohamed S. Abdelfattah , Jae-sun Seo , Zhiru Zhang , Udit Gupta

Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal…

Machine Learning · Computer Science 2025-08-05 Zishan Shao , Yixiao Wang , Qinsi Wang , Ting Jiang , Zhixu Du , Hancheng Ye , Danyang Zhuo , Yiran Chen , Hai Li

Modern large language model workloads put increasing demands on parallel compute capability and on-chip memory capacity, while also stressing fine-grained data movement and synchronization. These trends motivate exploring and designing…

Hardware Architecture · Computer Science 2026-05-11 Yinrong Li , Zexin Fu , Yichao Zhang , Germain Haugou , Chi Zhang , Marco Bertuletti , Bowen Wang , Luca Benini
‹ Prev 1 2 3 10 Next ›