Related papers: Parallel Loop Transformer for Efficient Test-Time …

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-10 Chong Wang , Nan Du , Tom Gunter , Tao Lei , Kulin Seth , Senyu Tong , Jianyu Wang , Guoli Yin , Xiyou Zhou , Kelvin Zou , Ruoming Pang

Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Mert Hidayetoglu , Aurick Qiao , Michael Wyatt , Jeff Rasley , Yuxiong He , Samyam Rajbhandari

Byte Latent Transformer: Patches Scale Better Than Tokens

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT…

Computation and Language · Computer Science 2024-12-16 Artidoro Pagnoni , Ram Pasunuru , Pedro Rodriguez , John Nguyen , Benjamin Muller , Margaret Li , Chunting Zhou , Lili Yu , Jason Weston , Luke Zettlemoyer , Gargi Ghosh , Mike Lewis , Ari Holtzman , Srinivasan Iyer

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by…

Computation and Language · Computer Science 2026-05-20 Victor Conchello Vendrell , Arnau Padres Masdemont , Niccolò Grillo , Jordi Ros-Giralt , Arash Behboodi , Fabio Valerio Massoli

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-09 Burak Topcu , Musa Oguzhan Cim , Poovaiah Palangappa , Meena Arunachalam , Mahmut Taylan Kandemir

Communication Compression for Tensor Parallel LLM Inference

Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators…

Machine Learning · Computer Science 2026-01-07 Jan Hansen-Palmus , Michael Truong Le , Oliver Hausdörfer , Alok Verma

Parallel Recursive LSTM

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context…

Machine Learning · Computer Science 2026-05-19 Tristan Gaudreault , Yongyi Mao

Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the…

Computation and Language · Computer Science 2026-05-11 Julie Kallini , Artidoro Pagnoni , Tomasz Limisiewicz , Gargi Ghosh , Luke Zettlemoyer , Christopher Potts , Xiaochuang Han , Srinivasan Iyer

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source…

Machine Learning · Computer Science 2026-05-27 Zeyi Huang , Xuehai He , LiLiang Ren , Yiping Wang , Baolin Peng , Hao Cheng , Shuohang Wang , Pengcheng He , Jianfeng Gao , Yong Jae Lee , Yelong Shen

A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We…

Computation and Language · Computer Science 2026-05-29 Markus Frey , Behzad Shomali , Joachim Koehler , Mehdi Ali

Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?

Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)--architectures that increase computational…

Computation and Language · Computer Science 2026-01-16 Guanxu Chen , Dongrui Liu , Jing Shao

LT2: Linear-Time Looped Transformers

Looped Transformers (LT) have emerged as a powerful architecture by iterating their layers multiple times before decoding the final token. However, pairing them with full attention retains quadratic complexity, making them computationally…

Machine Learning · Computer Science 2026-05-26 Chunyuan Deng , Yizhe Zhang , Rui-Jie Zhu , Yuanyuan Xu , Jiarui Liu , T. S. Eugene Ng , Hanjie Chen

On the Reasoning Abilities of Masked Diffusion Language Models

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their…

Machine Learning · Computer Science 2026-04-28 Anej Svete , Ashish Sabharwal

Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation

As large language models (LLMs) have shown great success in many tasks, they are used in various applications. While a lot of works have focused on the efficiency of single-LLM application (e.g., offloading, request scheduling, parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-24 Jingzhi Fang , Yanyan Shen , Yue Wang , Lei Chen

Parallel Latent Reasoning for Sequential Recommendation

Capturing complex user preferences from sparse behavioral sequences remains a fundamental challenge in sequential recommendation. Recent latent reasoning methods have shown promise by extending test-time computation through multi-step…

Information Retrieval · Computer Science 2026-01-07 Jiakai Tang , Xu Chen , Wen Chen , Jian Wu , Yuning Jiang , Bo Zheng

Simultaneous Machine Translation with Large Language Models

Real-world simultaneous machine translation (SimulMT) systems face more challenges than just the quality-latency trade-off. They also need to address issues related to robustness with noisy input, processing long contexts, and flexibility…

Computation and Language · Computer Science 2025-11-18 Minghan Wang , Jinming Zhao , Thuy-Trang Vu , Fatemeh Shiri , Ehsan Shareghi , Gholamreza Haffari

Faster and Better LLMs via Latency-Aware Test-Time Scaling

Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a…

Computation and Language · Computer Science 2025-09-15 Zili Wang , Tianyu Zhang , Haoli Bai , Lu Hou , Xianzhi Yu , Wulong Liu , Shiming Xiang , Lei Zhu

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In…

Computation and Language · Computer Science 2024-10-03 Wenzhen Zheng , Wenbo Pan , Xu Xu , Libo Qin , Li Yue , Ming Zhou

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices…

Computation and Language · Computer Science 2024-08-01 Keivan Alizadeh , Iman Mirzadeh , Dmitry Belenko , Karen Khatamifard , Minsik Cho , Carlo C Del Mundo , Mohammad Rastegari , Mehrdad Farajtabar