Related papers: Task Cascades for Efficient Unstructured Data Proc…

Online Cascade Learning for Efficient Inference over Streams

Large Language Models (LLMs) have a natural role in answering complex queries about data streams, but the high computational cost of LLM inference makes them infeasible in many such tasks. We propose online cascade learning, the first…

Machine Learning · Computer Science 2024-06-19 Lunyiu Nie , Zhimin Ding , Erdong Hu , Christopher Jermaine , Swarat Chaudhuri

QUEST: Query Optimization in Unstructured Document Analysis

Most recently, researchers have started building large language models (LLMs) powered data systems that allow users to analyze unstructured text documents like working with a database because LLMs are very effective in extracting attributes…

Databases · Computer Science 2025-07-14 Zhaoze Sun , Qiyan Deng , Chengliang Chai , Kaisen Jin , Xinyu Guo , Han Han , Ye Yuan , Guoren Wang , Lei Cao

A Deep Cascade Model for Multi-Document Reading Comprehension

A fundamental trade-off between effectiveness and efficiency needs to be balanced when designing an online question answering system. Effectiveness comes from sophisticated functions such as extractive machine reading comprehension (MRC),…

Computation and Language · Computer Science 2019-08-14 Ming Yan , Jiangnan Xia , Chen Wu , Bin Bi , Zhongzhou Zhao , Ji Zhang , Luo Si , Rui Wang , Wei Wang , Haiqing Chen

When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack

Large Language Model (LLM) cascade systems are designed to balance efficiency and performance by processing queries with lightweight models while selectively escalating complex cases to more powerful ones. Such systems seek to reduces…

Cryptography and Security · Computer Science 2026-05-19 Zehan Sun , Dingfan Chen , Songze Li

Cost-Saving LLM Cascades with Early Abstention

LLM cascades deploy small LLMs to answer most queries, limiting the use of large and expensive LLMs to difficult queries. This approach can significantly reduce costs without impacting performance. However, risk-sensitive domains such as…

Artificial Intelligence · Computer Science 2025-04-01 Michael J. Zellinger , Rex Liu , Matt Thomson

From Deferral to Learning: Online In-Context Knowledge Distillation for LLM Cascades

Standard LLM cascades improve efficiency by deferring difficult queries from weak to strong models. However, these systems are typically static: when faced with repeated or semantically similar queries, they redundantly consult the…

Artificial Intelligence · Computer Science 2026-02-04 Yu Wu , Shuo Wu , Ye Tao , Yansong Li , Anand D. Sarwate

Cascadia: An Efficient Cascade Serving System for Large Language Models

Recent advances in large language models (LLMs) have intensified the need to deliver both rapid responses and high-quality outputs. More powerful models yield better results but incur higher inference latency, whereas smaller models are…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-01 Youhe Jiang , Fangcheng Fu , Wanru Zhao , Stephan Rabanser , Jintao Zhang , Nicholas D. Lane , Binhang Yuan

Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning

Large language models (LLMs) such as GPT-4 have exhibited remarkable performance in a variety of tasks, but this strong performance often comes with the high expense of using paid API services. In this paper, we are motivated to study…

Computation and Language · Computer Science 2024-02-12 Murong Yue , Jie Zhao , Min Zhang , Liang Du , Ziyu Yao

Cascade-Aware Training of Language Models

Reducing serving cost and latency is a fundamental concern for the deployment of language models (LMs) in business applications. To address this, cascades of LMs offer an effective solution that conditionally employ smaller models for…

Computation and Language · Computer Science 2024-06-04 Congchao Wang , Sean Augenstein , Keith Rush , Wittawat Jitkrittum , Harikrishna Narasimhan , Ankit Singh Rawat , Aditya Krishna Menon , Alec Go

Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing

The rapid advancement of large language models (LLMs) has significantly improved code completion tasks, yet the trade-off between accuracy and computational cost remains a critical challenge. While using larger models and incorporating…

Software Engineering · Computer Science 2025-02-17 Boyuan Chen , Mingzhi Zhu , Brendan Dolan-Gavitt , Muhammad Shafique , Siddharth Garg

Language Model Cascades: Token-level uncertainty and beyond

Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks, but at the expense of increased inference costs. Cascading offers a simple strategy to achieve more favorable cost-quality…

Computation and Language · Computer Science 2024-04-17 Neha Gupta , Harikrishna Narasimhan , Wittawat Jitkrittum , Ankit Singh Rawat , Aditya Krishna Menon , Sanjiv Kumar

CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage often struggle with ambiguous queries, triggering premature…

Computation and Language · Computer Science 2026-04-15 Raeyoung Chang , Dongwook Kwon , Jisoo Lee , Nikhil Verma

Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees

Large Language Models (LLMs) are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs…

Databases · Computer Science 2025-09-15 Sepanta Zeighami , Shreya Shankar , Aditya Parameswaran

Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning

Recent successes in natural language processing have led to the proliferation of large language models (LLMs) by multiple providers. Each LLM offering has different inference accuracy, monetary cost, and latency, and their accuracy further…

Computation and Language · Computer Science 2024-11-21 Xuechen Zhang , Zijian Huang , Ege Onur Taga , Carlee Joe-Wong , Samet Oymak , Jiasi Chen

Towards Optimizing the Costs of LLM Usage

Generative AI and LLMs in particular are heavily used nowadays for various document processing tasks such as question answering and summarization. However, different LLMs come with different capabilities for different tasks as well as with…

Computation and Language · Computer Science 2024-02-06 Shivanshu Shekhar , Tanishq Dubey , Koyel Mukherjee , Apoorv Saxena , Atharv Tyagi , Nishanth Kotla

A Unified Approach to Routing and Cascading for LLMs

The availability of a wide range of large language models (LLMs) embedded in various agentic systems has significantly increased the potential of model selection strategies to improve the cost-performance tradeoff. Existing strategies…

Computation and Language · Computer Science 2025-05-23 Jasper Dekoninck , Maximilian Baader , Martin Vechev

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous…

Databases · Computer Science 2026-05-22 Hengrui Zhang , Yulong Hui , Yihao Liu , Huanchen Zhang

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models

This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations. Recently, prompting Large Language…

Computation and Language · Computer Science 2024-07-25 Mengkang Hu , Yao Mu , Xinmiao Yu , Mingyu Ding , Shiguang Wu , Wenqi Shao , Qiguang Chen , Bin Wang , Yu Qiao , Ping Luo

KiC: Keyword-inspired Cascade for Cost-Efficient Text Generation with LLMs

Large language models (LLMs) have demonstrated state-of-the-art performance across a wide range of natural language processing tasks. However, high-performing models are typically accessible only via APIs, incurring substantial inference…

Computation and Language · Computer Science 2025-07-21 Woo-Chan Kim , Ji-Hoon Park , Seong-Whan Lee

Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems

Do all instances need inference through the big models for a correct prediction? Perhaps not; some instances are easy and can be answered correctly by even small capacity models. This provides opportunities for improving the computational…

Computation and Language · Computer Science 2022-10-12 Neeraj Varshney , Chitta Baral