Related papers: Evaluating Long Range Dependency Handling in Code …

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate…

Computation and Language · Computer Science 2025-04-24 Jonathan Roberts , Kai Han , Samuel Albanie

LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?

Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is…

Computation and Language · Computer Science 2025-10-28 Ziyuan He , Yuxuan Wang , Jiaqi Li , Kexin Liang , Muhan Zhang

LLM In-Context Recall is Prompt Dependent

The proliferation of Large Language Models (LLMs) highlights the critical importance of conducting thorough evaluations to discern their comparative advantages, limitations, and optimal use cases. Particularly important is assessing their…

Computation and Language · Computer Science 2024-04-16 Daniel Machlab , Rick Battle

LongFuncEval: Measuring the effectiveness of long context models for function calling

Multiple recent studies have documented large language models' (LLMs) performance on calling external tools/functions. Others focused on LLMs' abilities to handle longer context lengths. At the intersection of these areas lies another…

Software Engineering · Computer Science 2025-05-19 Kiran Kate , Tejaswini Pedapati , Kinjal Basu , Yara Rizk , Vijil Chenthamarakshan , Subhajit Chaudhury , Mayank Agarwal , Ibrahim Abdelaziz

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

While recent large language models (LLMs) demonstrate remarkable abilities in responding to queries in diverse languages, their ability to handle long multilingual contexts is unexplored. As such, a systematic evaluation of the long-context…

Computation and Language · Computer Science 2024-08-20 Amey Hengle , Prasoon Bajpai , Soham Dan , Tanmoy Chakraborty

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities.…

Machine Learning · Computer Science 2025-12-03 Tsimur Hadeliya , Mohammad Ali Jauhar , Nidhi Sakpal , Diogo Cruz

RepoQA: Evaluating Long Context Code Understanding

Recent advances have been improving the context windows of Large Language Models (LLMs). To quantify the real long-context capabilities of LLMs, evaluators such as the popular Needle in a Haystack have been developed to test LLMs over a…

Software Engineering · Computer Science 2024-06-11 Jiawei Liu , Jia Le Tian , Vijay Daita , Yuxiang Wei , Yifeng Ding , Yuhan Katherine Wang , Jun Yang , Lingming Zhang

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to…

Computation and Language · Computer Science 2024-04-09 Xuanfan Ni , Hengyi Cai , Xiaochi Wei , Shuaiqiang Wang , Dawei Yin , Piji Li

Efficient Solutions For An Intriguing Failure of LLMs: Long Context Window Does Not Mean LLMs Can Analyze Long Sequences Flawlessly

Large Language Models (LLMs) have demonstrated remarkable capabilities in comprehending and analyzing lengthy sequential inputs, owing to their extensive context windows that allow processing millions of tokens in a single forward pass.…

Computation and Language · Computer Science 2024-12-23 Peyman Hosseini , Ignacio Castro , Iacopo Ghinassi , Matthew Purver

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs'…

Computation and Language · Computer Science 2024-09-09 Jiaqi Li , Mengmeng Wang , Zilong Zheng , Muhan Zhang

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We…

Computation and Language · Computer Science 2025-09-30 Xi Ye , Fangcong Yin , Yinghui He , Joie Zhang , Howard Yen , Tianyu Gao , Greg Durrett , Danqi Chen

Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models

Long-context modeling capabilities are important for large language models (LLMs) in various applications. However, directly training LLMs with long context windows is insufficient to enhance this capability since some training samples do…

Computation and Language · Computer Science 2024-05-29 Longze Chen , Ziqiang Liu , Wanwei He , Yunshui Li , Run Luo , Min Yang

Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model's ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric…

Computation and Language · Computer Science 2025-04-18 Amey Hengle , Prasoon Bajpai , Soham Dan , Tanmoy Chakraborty

Systematic Evaluation of Long-Context LLMs on Financial Concepts

Long-context large language models (LC LLMs) promise to increase reliability of LLMs in real-world tasks requiring processing and understanding of long input documents. However, this ability of LC LLMs to reliably utilize their growing…

Computation and Language · Computer Science 2024-12-23 Lavanya Gupta , Saket Sharma , Yiyun Zhao

Long Context RAG Performance of Large Language Models

Retrieval Augmented Generation (RAG) has emerged as a crucial technique for enhancing the accuracy of Large Language Models (LLMs) by incorporating external information. With the advent of LLMs that support increasingly longer context…

Machine Learning · Computer Science 2024-11-07 Quinn Leng , Jacob Portes , Sam Havens , Matei Zaharia , Michael Carbin

LongGenBench: Long-context Generation Benchmark

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context…

Computation and Language · Computer Science 2024-10-25 Xiang Liu , Peijie Dong , Xuming Hu , Xiaowen Chu

Compositional Hardness of Code in Large Language Models -- A Probabilistic Perspective

A common practice in large language model (LLM) usage for complex analytical tasks such as code generation, is to sample a solution for the entire task within the model's context window. Previous works have shown that subtask decomposition…

Artificial Intelligence · Computer Science 2025-02-03 Yotam Wolf , Binyamin Rothberg , Dorin Shteyman , Amnon Shashua

An Effective Framework to Help Large Language Models Handle Numeric-involved Long-context Tasks

Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long texts and have almost perfect performance in traditional retrieval tasks. However, their performance significantly degrades when it comes to numerical…

Computation and Language · Computer Science 2024-12-05 Yijiong Yu

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context…

Software Engineering · Computer Science 2026-02-20 Kishan Maharaj , Nandakishore Menon , Ashita Saxena , Srikanth Tamilselvam

Exploring LLM Reasoning Through Controlled Prompt Variations

This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how…

Artificial Intelligence · Computer Science 2025-04-04 Giannis Chatziveroglou , Richard Yun , Maura Kelleher