Related papers: Vectorizing the Trie: Efficient Constrained Decodi…

A Static Pruning Study on Sparse Neural Retrievers

Sparse neural retrievers, such as DeepImpact, uniCOIL and SPLADE, have been introduced recently as an efficient and effective way to perform retrieval with inverted indexes. They aim to learn term importance and, in some cases, document…

Information Retrieval · Computer Science 2023-04-26 Carlos Lassance , Simon Lupart , Hervé Dejean , Stéphane Clinchant , Nicola Tonellotto

Leveraging Recurrent Patterns in Graph Accelerators

Graph accelerators have emerged as a promising solution for processing large-scale sparse graphs, leveraging the in-situ compu-tation of ReRAM-based crossbars to maximize computational efficiency. However, existing designs suffer from…

Hardware Architecture · Computer Science 2025-12-02 Masoud Rahimi , Sébastien Le Beux

Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding

The past few years have witnessed a growing interest in LLM-based recommender systems (RSs), although their industrial deployment remains in a preliminary stage. Most existing deployments leverage LLMs offline as feature enhancers,…

Information Retrieval · Computer Science 2025-04-30 Yunjia Xi , Hangyu Wang , Bo Chen , Jianghao Lin , Menghui Zhu , Weiwen Liu , Ruiming Tang , Zhewei Wei , Weinan Zhang , Yong Yu

Text2Tracks: Prompt-based Music Recommendation via Generative Retrieval

In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. "Can you recommend some old classics for slow dancing?"). In this setup, the…

Information Retrieval · Computer Science 2025-04-03 Enrico Palumbo , Gustavo Penha , Andreas Damianou , José Luis Redondo García , Timothy Christopher Heath , Alice Wang , Hugues Bouchard , Mounia Lalmas

Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized…

Computation and Language · Computer Science 2025-04-15 Haotian Ye , Himanshu Jain , Chong You , Ananda Theertha Suresh , Haowei Lin , James Zou , Felix Yu

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic…

Computation and Language · Computer Science 2024-05-28 Yun Zhu , Jia-Chen Gu , Caitlin Sikora , Ho Ko , Yinxiao Liu , Chu-Cheng Lin , Lei Shu , Liangchen Luo , Lei Meng , Bang Liu , Jindong Chen

Multilingual Generative Retrieval via Cross-lingual Semantic Compression

Generative Information Retrieval is an emerging retrieval paradigm that exhibits remarkable performance in monolingual scenarios.However, applying these methods to multilingual retrieval still encounters two primary challenges,…

Computation and Language · Computer Science 2025-10-10 Yuxin Huang , Simeng Wu , Ran Song , Yan Xiang , Yantuan Xian , Shengxiang Gao , Zhengtao Yu

Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

Generative Recommendation (GR) has recently transitioned from atomic item-indexing to Semantic ID (SID)-based frameworks to capture intrinsic item relationships and enhance generalization. However, the adoption of high-granularity SIDs…

Information Retrieval · Computer Science 2026-04-08 Tianyu Zhan , Kairui Fu , Chengfei Lv , Zheqi Lv , Shengyu Zhang

SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models

Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. However, with improvements in performance, the parameter size of DLRMs…

Hardware Architecture · Computer Science 2025-04-02 Jinho Yang , Ji-Hoon Kim , Joo-Young Kim

FAST: Flexible and Adaptive Semantic Transmission for Resource-constrained Multi-user Generative Semantic Communication

The rapid advancement of generative artificial intelligence has spurred innovative approaches to semantic communication, giving rise to a new paradigm known as generative semantic communication (GSC). The integration of flexible cross-modal…

Signal Processing · Electrical Eng. & Systems 2025-11-03 Yiru Wang , Wanting Yang , Fangli Mou , Zehui Xiong , Zide Fan , Shiwen Mao , Tony Q. S. Quek

A Hardware-Oriented and Memory-Efficient Method for CTC Decoding

The Connectionist Temporal Classification (CTC) has achieved great success in sequence to sequence analysis tasks such as automatic speech recognition (ASR) and scene text recognition (STR). These applications can use the CTC objective…

Signal Processing · Electrical Eng. & Systems 2019-09-09 Siyuan Lu , Jinming Lu , Jun Lin , Zhongfeng Wang

DReSD: Dense Retrieval for Speculative Decoding

Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its…

Computation and Language · Computer Science 2025-05-30 Milan Gritta , Huiyin Xue , Gerasimos Lampouras

Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval

Generative retrieval (GR) ranks documents by autoregressively generating document identifiers. Because many GR methods rely on trie-constrained beam search, they are vulnerable to early pruning of relevant prefixes under finite-beam…

Information Retrieval · Computer Science 2026-05-26 Kidist Amde Mekonnen , Yongkang Li , Yubao Tang , Simon Lupart , Maarten de Rijke

Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of…

Computer Vision and Pattern Recognition · Computer Science 2026-02-12 Yiyu Wang , Xuyang Liu , Xiyan Gui , Xinying Lin , Boxue Yang , Chenfei Liao , Tailai Chen , Linfeng Zhang

Adaptive Retrieval helps Reasoning in LLMs -- but mostly if it's not used

Large Language Models (LLMs) often falter in complex reasoning tasks due to their static, parametric knowledge, leading to hallucinations and poor performance in specialized domains like mathematics. This work explores a fundamental…

Machine Learning · Computer Science 2026-02-10 Srijan Shakya , Anamaria-Roberta Hartl , Sepp Hochreiter , Korbinian Pöppel

REST: Retrieval-Based Speculative Decoding

We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often…

Computation and Language · Computer Science 2024-04-05 Zhenyu He , Zexuan Zhong , Tianle Cai , Jason D. Lee , Di He

T-Retriever: Tree-based Hierarchical Retrieval Augmented Generation for Textual Graphs

Retrieval-Augmented Generation (RAG) has significantly enhanced Large Language Models' ability to access external knowledge, yet current graph-based RAG approaches face two critical limitations in managing hierarchical information: they…

Artificial Intelligence · Computer Science 2026-01-09 Chunyu Wei , Huaiyu Qin , Siyuan He , Yunhai Wang , Yueguo Chen

STAR: Semantic-Tuned and Tail-Adaptive Retriever for Graph-Augmented Generation

To augment Large Language Models (LLMs) for multi-hop question answering, a mainstream solution within Graph Retrieval Augmented Generation (GraphRAG) leverages lightweight retrievers to efficiently extract information from a given…

Information Retrieval · Computer Science 2026-05-20 Shuai Li , Chen Huang , Duanyu Feng , Wenqiang Lei , See-Kiong Ng

Efficient Inference for Large Language Model-based Generative Recommendation

Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding…

Information Retrieval · Computer Science 2025-02-27 Xinyu Lin , Chaoqun Yang , Wenjie Wang , Yongqi Li , Cunxiao Du , Fuli Feng , See-Kiong Ng , Tat-Seng Chua

Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control

Learned sparse retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors that can be indexed and retrieved efficiently with an inverted index. We explore the application of LSR to the…

Information Retrieval · Computer Science 2024-02-28 Thong Nguyen , Mariya Hendriksen , Andrew Yates , Maarten de Rijke