Related papers: BTR: Binary Token Representations for Efficient Re…

LFTR: Learning-Free Token Reduction for Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Zihui Zhao , Yingxin Li , Yang Li

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach…

Computation and Language · Computer Science 2021-05-26 Deming Ye , Yankai Lin , Yufei Huang , Maosong Sun

Improving language models by retrieving from trillions of tokens

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO)…

Computation and Language · Computer Science 2022-02-09 Sebastian Borgeaud , Arthur Mensch , Jordan Hoffmann , Trevor Cai , Eliza Rutherford , Katie Millican , George van den Driessche , Jean-Baptiste Lespiau , Bogdan Damoc , Aidan Clark , Diego de Las Casas , Aurelia Guy , Jacob Menick , Roman Ring , Tom Hennigan , Saffron Huang , Loren Maggiore , Chris Jones , Albin Cassirer , Andy Brock , Michela Paganini , Geoffrey Irving , Oriol Vinyals , Simon Osindero , Karen Simonyan , Jack W. Rae , Erich Elsen , Laurent Sifre

Thinking Augmented Pre-training

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an…

Computation and Language · Computer Science 2025-10-20 Liang Wang , Nan Yang , Shaohan Huang , Li Dong , Furu Wei

TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

Since ChatGPT released its API for public use, the number of applications built on top of commercial large language models (LLMs) increase exponentially. One popular usage of such models is leveraging its in-context learning ability and…

Computation and Language · Computer Science 2023-10-26 Junyi Liu , Liangzhi Li , Tong Xiang , Bowen Wang , Yiming Qian

Enhancing Knowledge Retrieval with In-Context Learning and Semantic Search through Generative AI

Retrieving and extracting knowledge from extensive research documents and large databases presents significant challenges for researchers, students, and professionals in today's information-rich era. Existing retrieval systems, which rely…

Information Retrieval · Computer Science 2025-02-06 Mohammed-Khalil Ghali , Abdelrahman Farrag , Daehan Won , Yu Jin

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

Large Language Models (LLMs) have swiftly emerged as vital resources for different applications in the biomedical and healthcare domains; however, these models encounter issues such as generating inaccurate information or hallucinations.…

Computation and Language · Computer Science 2024-05-06 Mingchen Li , Halil Kilicoglu , Hua Xu , Rui Zhang

Near-lossless Binarization of Word Embeddings

Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of…

Computation and Language · Computer Science 2020-01-23 Julien Tissier , Christophe Gravier , Amaury Habrard

AdapLeR: Speeding up Inference by Adaptive Length Reduction

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a…

Computation and Language · Computer Science 2022-03-18 Ali Modarressi , Hosein Mohebbi , Mohammad Taher Pilehvar

MEMORY-VQ: Compression for Tractable Internet-Scale Memory

Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up…

Computation and Language · Computer Science 2023-08-30 Yury Zemlyanskiy , Michiel de Jong , Luke Vilnis , Santiago Ontañón , William W. Cohen , Sumit Sanghai , Joshua Ainslie

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Transformers have a quadratic scaling of computational complexity with input size, which limits the input context window size of large language models (LLMs) in both training and inference. Meanwhile, retrieval-augmented generation (RAG)…

Computation and Language · Computer Science 2024-10-18 Yimin Tang , Yurong Xu , Ning Yan , Masood Mortazavi

Scaling Transformer to 1M tokens and beyond with RMT

A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer…

Computation and Language · Computer Science 2024-02-07 Aydar Bulatov , Yuri Kuratov , Yermek Kapushev , Mikhail S. Burtsev

On Retrieval Augmentation and the Limitations of Language Model Training

Augmenting a language model (LM) with $k$-nearest neighbors ($k$NN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited…

Computation and Language · Computer Science 2024-04-03 Ting-Rui Chiang , Xinyan Velocity Yu , Joshua Robinson , Ollie Liu , Isabelle Lee , Dani Yogatama

Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval

Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added post-hoc to an already-pretrained LM, which limits the…

Computation and Language · Computer Science 2024-07-23 Ohad Rubin , Jonathan Berant

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures…

Computation and Language · Computer Science 2025-05-27 Xianzhen Luo , Yixuan Wang , Qingfu Zhu , Zhiming Zhang , Xuanyu Zhang , Qing Yang , Dongliang Xu

Efficient numeracy in language models through single-token number embeddings

To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or…

Machine Learning · Computer Science 2026-05-21 Linus Kreitner , Paul Hager , Jonathan Mengedoht , Georgios Kaissis , Daniel Rueckert , Martin J. Menten

TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation

The high inference cost of Large Language Models (LLMs) poses challenges, especially for tasks requiring lengthy outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed…

Computation and Language · Computer Science 2025-11-25 Alfredo Garrachón Ruiz , Tomás de la Rosa , Daniel Borrajo

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning…

Machine Learning · Computer Science 2025-02-26 Deqing Fu , Tong Xiao , Rui Wang , Wang Zhu , Pengchuan Zhang , Guan Pang , Robin Jia , Lawrence Chen

One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models

Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) for generating more factual, accurate, and up-to-date content. Existing methods either optimize prompts to guide LLMs in leveraging retrieved…

Computation and Language · Computer Science 2024-12-12 Yutao Zhu , Zhaoheng Huang , Zhicheng Dou , Ji-Rong Wen

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make…

Machine Learning · Computer Science 2025-05-05 Mohsen Dehghankar , Mahdi Erfanian , Abolfazl Asudeh