Related papers: FlashSampling: Fast and Memory-Efficient Exact Sam…

SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

As large language models (LLMs) scale out with tensor parallelism (TP) and pipeline parallelism (PP) and production stacks have aggressively optimized the data plane (attention/GEMM and KV cache), sampling, the decision plane that turns…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-02 Bohan Zhao , Zane Cao , Yongchao He

FLASH-MAXSIM: IO-Aware Fused Kernels for Late-Interaction Scoring

Late-interaction retrieval (ColBERT, ColPali) scores a query against a document with the MaxSim operator: for every query token, the maximum similarity over the document tokens, summed over query tokens. The standard implementation…

Information Retrieval · Computer Science 2026-05-29 Roi Pony , Adi Raz Goldfarb , Idan Friedman , Daniel Ezer , Udi Barzelay

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-20 Haojun Xia , Zhen Zheng , Yuchao Li , Donglin Zhuang , Zhongzhu Zhou , Xiafei Qiu , Yong Li , Wei Lin , Shuaiwen Leon Song

FlashDecoding++: Faster Large Language Model Inference on GPUs

As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation…

Machine Learning · Computer Science 2024-01-08 Ke Hong , Guohao Dai , Jiaming Xu , Qiuli Mao , Xiuhong Li , Jun Liu , Kangdi Chen , Yuhan Dong , Yu Wang

FlashNorm: Fast Normalization for Transformers

Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing…

Machine Learning · Computer Science 2026-04-28 Nils Graef , Filip Makraduli , Andrew Wasielewski , Matthew Clapp

FlashMP: Fast Discrete Transform-Based Solver for Preconditioning Maxwell's Equations on GPUs

Efficiently solving large-scale linear systems is a critical challenge in electromagnetic simulations, particularly when using the Crank-Nicolson Finite-Difference Time-Domain (CN-FDTD) method. Existing iterative solvers are commonly…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-24 Haoyuan Zhang , Yaqian Gao , Xinxin Zhang , Jialin Li , Runfeng Jin , Yidong Chen , Feng Zhang , Wu Yuan , Wenpeng Ma , Shan Liang , Jian Zhang , Zhonghua Lu

FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

Despite a big leap forward in capability, multimodal large language models (MLLMs) tend to behave like a sloth in practical use, i.e., slow response and large latency. Recent efforts are devoted to building tiny MLLMs for better efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Bo Tong , Bokai Lai , Yiyi Zhou , Gen Luo , Yunhang Shen , Ke Li , Xiaoshuai Sun , Rongrong Ji

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding…

Machine Learning · Computer Science 2024-02-06 Yichao Fu , Peter Bailis , Ion Stoica , Hao Zhang

FlashGMM: Fast Gaussian Mixture Entropy Model for Learned Image Compression

High-performance learned image compression codecs require flexible probability models to fit latent representations. Gaussian Mixture Models (GMMs) were proposed to satisfy this demand, but suffer from a significant runtime performance…

Image and Video Processing · Electrical Eng. & Systems 2025-09-24 Shimon Murai , Fangzheng Lin , Jiro Katto

FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection

The scaling of computation throughput continues to outpace improvements in memory bandwidth, making many deep learning workloads memory-bound. Kernel fusion is a key technique to alleviate this problem, but the fusion strategies of existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Ziyu Huang , Yangjie Zhou , Zihan Liu , Xinhao Luo , Yijia Diao , Minyi Guo , Jidong Zhai , Yu Feng , Chen Zhang , Anbang Wu , Jingwen Leng

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-12-17 Jinliang Shi , Shigang Li , Youxuan Xu , Rongtian Fu , Xueying Wang , Tong Wu

FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment. While recent methods…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-04 Huadai Liu , Jialei Wang , Rongjie Huang , Yang Liu , Heng Lu , Zhou Zhao , Wei Xue

FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

Language models are increasingly adopting smaller architectures optimized for consumer devices. In this setting, inference efficiency is the primary constraint. Meanwhile, vocabulary sizes continue to grow rapidly, making the classification…

Machine Learning · Computer Science 2026-03-17 Wilhelm Tranheden , Shahnawaz Ahmed , Devdatt Dubhashi , Jonna Matthiesen , Hannes von Essen

Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference

Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the…

Computation and Language · Computer Science 2025-10-28 Jiayi Yuan , Hao Li , Xinheng Ding , Wenya Xie , Yu-Jhe Li , Wentian Zhao , Kun Wan , Jing Shi , Xia Hu , Zirui Liu

FlashMol: High-Quality Molecule Generation in as Few as Four Steps

Generating chemically valid 3D molecular conformations is critical for computational drug discovery. Classical diffusion-based models like GeoLDM perform well but require hundreds of steps, making large-scale in silico screening…

Machine Learning · Computer Science 2026-05-11 Xinyuan Wei , Zian Li , Shaoheng Yan , Cai Zhou , Muhan Zhang

MagicPIG: LSH Sampling for Efficient LLM Generation

Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have…

Computation and Language · Computer Science 2024-12-19 Zhuoming Chen , Ranajoy Sadhukhan , Zihao Ye , Yang Zhou , Jianyu Zhang , Niklas Nolte , Yuandong Tian , Matthijs Douze , Leon Bottou , Zhihao Jia , Beidi Chen

FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream…

Computation and Language · Computer Science 2025-10-10 Zhanqiu Hu , Jian Meng , Yash Akhauri , Mohamed S. Abdelfattah , Jae-sun Seo , Zhiru Zhang , Udit Gupta

FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Singular Value Decomposition (SVD) has recently seen a surge of interest as a simple yet powerful tool for large language models (LLMs) compression, with a growing number of works demonstrating 20-80% parameter reductions at minimal…

Machine Learning · Computer Science 2025-08-05 Zishan Shao , Yixiao Wang , Qinsi Wang , Ting Jiang , Zhixu Du , Hancheng Ye , Danyang Zhuo , Yiran Chen , Hai Li

Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

Modern large language model workloads put increasing demands on parallel compute capability and on-chip memory capacity, while also stressing fine-grained data movement and synchronization. These trends motivate exploring and designing…

Hardware Architecture · Computer Science 2026-05-11 Yinrong Li , Zexin Fu , Yichao Zhang , Germain Haugou , Chi Zhang , Marco Bertuletti , Bowen Wang , Luca Benini