English
Related papers

Related papers: Parallel Loop Transformer for Efficient Test-Time …

200 papers

Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-10 Chong Wang , Nan Du , Tom Gunter , Tao Lei , Kulin Seth , Senyu Tong , Jianyu Wang , Guoli Yin , Xiyou Zhou , Kelvin Zou , Ruoming Pang

Efficient parallelism is necessary for achieving low-latency, high-throughput inference with large language models (LLMs). Tensor parallelism (TP) is the state-of-the-art method for reducing LLM response latency, however GPU communications…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-27 Mert Hidayetoglu , Aurick Qiao , Michael Wyatt , Jeff Rasley , Yuxiong He , Samyam Rajbhandari

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT…

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by…

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-09 Burak Topcu , Musa Oguzhan Cim , Poovaiah Palangappa , Meena Arunachalam , Mahmut Taylan Kandemir

Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators…

Machine Learning · Computer Science 2026-01-07 Jan Hansen-Palmus , Michael Truong Le , Oliver Hausdörfer , Alok Verma

Transformers have become the dominant architecture for sequence modeling by using self-attention to enable expressive and highly parallel processing. However, the resulting quadratic time and memory costs limit efficiency in long-context…

Machine Learning · Computer Science 2026-05-19 Tristan Gaudreault , Yongyi Mao

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the…

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source…

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We…

Computation and Language · Computer Science 2026-05-29 Markus Frey , Behzad Shomali , Joachim Koehler , Mehdi Ali

Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)--architectures that increase computational…

Computation and Language · Computer Science 2026-01-16 Guanxu Chen , Dongrui Liu , Jing Shao

Looped Transformers (LT) have emerged as a powerful architecture by iterating their layers multiple times before decoding the final token. However, pairing them with full attention retains quadratic complexity, making them computationally…

Machine Learning · Computer Science 2026-05-26 Chunyuan Deng , Yizhe Zhang , Rui-Jie Zhu , Yuanyuan Xu , Jiarui Liu , T. S. Eugene Ng , Hanjie Chen

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their…

Machine Learning · Computer Science 2026-04-28 Anej Svete , Ashish Sabharwal

As large language models (LLMs) have shown great success in many tasks, they are used in various applications. While a lot of works have focused on the efficiency of single-LLM application (e.g., offloading, request scheduling, parallelism…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-24 Jingzhi Fang , Yanyan Shen , Yue Wang , Lei Chen

Capturing complex user preferences from sparse behavioral sequences remains a fundamental challenge in sequential recommendation. Recent latent reasoning methods have shown promise by extending test-time computation through multi-step…

Information Retrieval · Computer Science 2026-01-07 Jiakai Tang , Xu Chen , Wen Chen , Jian Wu , Yuning Jiang , Bo Zheng

Real-world simultaneous machine translation (SimulMT) systems face more challenges than just the quality-latency trade-off. They also need to address issues related to robustness with noisy input, processing long contexts, and flexibility…

Computation and Language · Computer Science 2025-11-18 Minghan Wang , Jinming Zhao , Thuy-Trang Vu , Fatemeh Shiri , Ehsan Shareghi , Gholamreza Haffari

Test-Time Scaling (TTS) has proven effective in improving the performance of Large Language Models (LLMs) during inference. However, existing research has overlooked the efficiency of TTS from a latency-sensitive perspective. Through a…

Computation and Language · Computer Science 2025-09-15 Zili Wang , Tianyu Zhang , Haoli Bai , Lu Hou , Xianzhi Yu , Wulong Liu , Shiming Xiang , Lei Zhu

In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In…

Computation and Language · Computer Science 2024-10-03 Wenzhen Zheng , Wenbo Pan , Xu Xu , Libo Qin , Li Yue , Ming Zhou

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices…

‹ Prev 1 2 3 10 Next ›