Related papers: LONGCODEU: Benchmarking Long-Context Language Mode…

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not…

Computation and Language · Computer Science 2025-10-23 Stefano Rando , Luca Romani , Alessio Sampieri , Luca Franco , John Yang , Yuta Kyuragi , Fabio Galasso , Tatsunori Hashimoto

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a…

Software Engineering · Computer Science 2025-04-10 Dung Nguyen Manh , Thang Phan Chau , Nam Le Hai , Thong T. Doan , Nam V. Nguyen , Quang Pham , Nghi D. Q. Bui

MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively…

Computation and Language · Computer Science 2025-07-31 Zhongzhan Huang , Guoming Ling , Shanshan Zhong , Hefeng Wu , Liang Lin

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports,…

Computation and Language · Computer Science 2024-06-21 Yushi Bai , Xin Lv , Jiajie Zhang , Hongchang Lyu , Jiankai Tang , Zhidian Huang , Zhengxiao Du , Xiao Liu , Aohan Zeng , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive…

Software Engineering · Computer Science 2025-09-12 Jielin Qiu , Zuxin Liu , Zhiwei Liu , Rithesh Murthy , Jianguo Zhang , Haolin Chen , Shiyu Wang , Ming Zhu , Liangwei Yang , Juntao Tan , Zhepeng Cen , Cheng Qian , Shelby Heinecke , Weiran Yao , Silvio Savarese , Caiming Xiong , Huan Wang

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Large language models (LLMs), despite their impressive performance in various language tasks, are typically limited to processing texts within context-window size. This limitation has spurred significant research efforts to enhance LLMs'…

Computation and Language · Computer Science 2024-09-09 Jiaqi Li , Mengmeng Wang , Zilong Zheng , Muhan Zhang

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these…

Computation and Language · Computer Science 2025-08-14 Shawn Gavin , Tuney Zheng , Jiaheng Liu , Quehry Que , Noah Wang , Jian Yang , Chenchen Zhang , Wenhao Huang , Ge Zhang

LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?

Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is…

Computation and Language · Computer Science 2025-10-28 Ziyuan He , Yuxuan Wang , Jiaqi Li , Kexin Liang , Muhan Zhang

Visual Context Window Extension: A New Perspective for Long Video Understanding

Large Multimodal Models (LMMs) have demonstrated impressive performance in short video understanding tasks but face great challenges when applied to long video understanding. In contrast, Large Language Models (LLMs) exhibit outstanding…

Computer Vision and Pattern Recognition · Computer Science 2024-10-03 Hongchen Wei , Zhenzhong Chen

Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models

Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing -- a crucial task that requires LCLMs to attribute items of interest to specific parts of…

Computation and Language · Computer Science 2025-08-05 Junjie Wu , Gefei Gu , Yanan Zheng , Dit-Yan Yeung , Arman Cohan

LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to…

Computation and Language · Computer Science 2025-10-21 Sheikh Jubair , Arwa Omayrah , Amal Alshammari , Alhanoof Althnian , Abdulhamed Alothaimen , Norah A. Alzahrani , Shahad D. Alzaidi , Nora Al-Twairesh , Abdulmohsen Al-Thubaity

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack…

Software Engineering · Computer Science 2026-01-01 Ruida Hu , Xinchen Wang , Xin-Cheng Wen , Zhao Zhang , Bo Jiang , Pengfei Gao , Chao Peng , Cuiyun Gao

Can Large Language Models Understand Context?

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various…

Computation and Language · Computer Science 2024-02-02 Yilun Zhu , Joel Ruben Antony Moniz , Shruti Bhargava , Jiarui Lu , Dhivya Piraviperumal , Site Li , Yuan Zhang , Hong Yu , Bo-Hsiang Tseng

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find…

Computation and Language · Computer Science 2025-05-27 Wang Yang , Hongye Jin , Shaochen Zhong , Song Jiang , Qifan Wang , Vipin Chaudhary , Xiaotian Han

LongFuncEval: Measuring the effectiveness of long context models for function calling

Multiple recent studies have documented large language models' (LLMs) performance on calling external tools/functions. Others focused on LLMs' abilities to handle longer context lengths. At the intersection of these areas lies another…

Software Engineering · Computer Science 2025-05-19 Kiran Kate , Tejaswini Pedapati , Kinjal Basu , Yara Rizk , Vijil Chenthamarakshan , Subhajit Chaudhury , Mayank Agarwal , Ibrahim Abdelaziz

MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models

The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical…

Computer Vision and Pattern Recognition · Computer Science 2025-10-16 Keyan Zhou , Zecheng Tang , Lingfeng Ming , Guanghao Zhou , Qiguang Chen , Dan Qiao , Zheming Yang , Libo Qin , Minghui Qiu , Juntao Li , Min Zhang

LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus…

Computation and Language · Computer Science 2025-11-19 Zhan Ling , Kang Liu , Kai Yan , Yifan Yang , Weijian Lin , Ting-Han Fan , Lingfeng Shen , Zhengyin Du , Jiecao Chen

Long-context LLMs Struggle with Long In-context Learning

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to…

Computation and Language · Computer Science 2024-06-13 Tianle Li , Ge Zhang , Quy Duc Do , Xiang Yue , Wenhu Chen

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document…

Computer Vision and Pattern Recognition · Computer Science 2024-11-13 Yubo Ma , Yuhang Zang , Liangyu Chen , Meiqi Chen , Yizhu Jiao , Xinze Li , Xinyuan Lu , Ziyu Liu , Yan Ma , Xiaoyi Dong , Pan Zhang , Liangming Pan , Yu-Gang Jiang , Jiaqi Wang , Yixin Cao , Aixin Sun

SELU: A Software Engineering Language Understanding Benchmark

Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation. However, their effectiveness on non-code Software Engineering (SE) tasks remains underexplored. We present 'Software Engineering…

Software Engineering · Computer Science 2026-02-12 Fabian C. Peña , Steffen Herbold