Related papers: CL-bench: A Benchmark for Context Learning

CL-bench Life: Can Language Models Learn from Real-Life Context?

Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of…

Computation and Language · Computer Science 2026-05-01 Shihan Dou , Yujiong Shen , Chenhao Huang , Junjie Ye , Jiayi Chen , Junzhe Wang , Qianyu He , Shichun Liu , Changze Lv , Jiahang Lin , Jiazheng Zhang , Ming Zhang , Shaofan Liu , Tao Ji , Zhangyue Yin , Cheng Zhang , Huaibing Xie , Jianglu Hu , Jingcheng Deng , Lincheng Li , Minda Hu , Shaolei Wang , Syrus Zhao , Weichao Wang , Yan Lei , Yang Liu , Yanling Xiao , Yiting Liu , Zenan Xu , Zhen Guo , Ziliang Zhao , Pluto Zhou , Tao Gui , Qi Zhang , Xuanjing Huang , Yu-Gang Jiang , Di Wang , Shunyu Yao

Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific contexts.…

Artificial Intelligence · Computer Science 2026-05-26 Hongbo Jin , Mingnan Zhu , Jingqi Tian , Xu Jiang , Zhongjing Du , Haoran Tang , Siyi Xie , Qiaoman Zhang , Jiayu Ding

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports,…

Computation and Language · Computer Science 2024-06-21 Yushi Bai , Xin Lv , Jiajie Zhang , Hongchang Lyu , Jiankai Tang , Zhidian Huang , Zhengxiao Du , Xiao Liu , Aohan Zeng , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

Can Large Language Models Understand Context?

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various…

Computation and Language · Computer Science 2024-02-02 Yilun Zhu , Joel Ruben Antony Moniz , Shruti Bhargava , Jiarui Lu , Dhivya Piraviperumal , Site Li , Yuan Zhang , Hong Yu , Bo-Hsiang Tseng

MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only…

Computer Vision and Pattern Recognition · Computer Science 2026-05-14 Yifan Chen , Fei Yin , Qingyan Bai , Zicheng Lin , Yujiu Yang

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to…

Computation and Language · Computer Science 2024-04-09 Xuanfan Ni , Hengyi Cai , Xiaochi Wei , Shuaiqiang Wang , Dawei Yin , Piji Li

LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world…

Computation and Language · Computer Science 2026-01-07 Ziyang Chen , Xing Wu , Junlong Jia , Chaochen Gao , Qi Fu , Debing Zhang , Songlin Hu

ContextBench: A Benchmark for Context Retrieval in Coding Agents

LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during…

Machine Learning · Computer Science 2026-02-12 Han Li , Letian Zhu , Bohan Zhang , Rili Feng , Jiaming Wang , Yue Pan , Earl T. Barr , Federica Sarro , Zhaoyang Chu , He Ye

LawBench: Benchmarking Legal Knowledge of Large Language Models

Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they…

Computation and Language · Computer Science 2023-09-29 Zhiwei Fei , Xiaoyu Shen , Dawei Zhu , Fengzhe Zhou , Zhuo Han , Songyang Zhang , Kai Chen , Zongwen Shen , Jidong Ge

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier…

Artificial Intelligence · Computer Science 2026-04-20 Ankit Maloo

RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity

People often encounter role conflicts -- social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) increasingly navigate these social dynamics, a critical research…

Computation and Language · Computer Science 2026-04-20 Jisu Shin , Hoyun Song , Juhyun Oh , Changgeon Ko , Eunsu Kim , Chani Jung , Alice Oh

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented…

Computation and Language · Computer Science 2025-05-07 Tao Zhang , Chenglin Zhu , Yanjun Shen , Wenjing Luo , Yan Zhang , Hao Liang , Tao Zhang , Fan Yang , Mingan Lin , Yujing Qiao , Weipeng Chen , Bin Cui , Wentao Zhang , Zenan Zhou

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging…

Computation and Language · Computer Science 2025-01-06 Yushi Bai , Shangqing Tu , Jiajie Zhang , Hao Peng , Xiaozhi Wang , Xin Lv , Shulin Cao , Jiazheng Xu , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Large language models (LLMs) have demonstrated significant potential in advancing various fields of research and society. However, the current community of LLMs overly focuses on benchmarks for analyzing specific foundational skills (e.g.…

Computation and Language · Computer Science 2025-03-03 Xiaoshuai Song , Muxi Diao , Guanting Dong , Zhengyang Wang , Yujia Fu , Runqi Qiao , Zhexu Wang , Dayuan Fu , Huangxuan Wu , Bin Liang , Weihao Zeng , Yejie Wang , Zhuoma GongQue , Jianing Yu , Qiuna Tan , Weiran Xu

Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings

The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models -- LLMs finetuned…

Computation and Language · Computer Science 2025-03-21 Austin Xu , Srijan Bansal , Yifei Ming , Semih Yavuz , Shafiq Joty

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere…

Computation and Language · Computer Science 2026-03-10 Xiaona Xue , Yiqiao Huang , Jiacheng Li , Yuanhang Zheng , Huiqi Miao , Yunfei Ma , Rui Liu , Xinbao Sun , Minglu Liu , Fanyu Meng , Chao Deng , Junlan Feng

Long-context LLMs Struggle with Long In-context Learning

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to…

Computation and Language · Computer Science 2024-06-13 Tianle Li , Ge Zhang , Quy Duc Do , Xiang Yue , Wenhu Chen

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find…

Computation and Language · Computer Science 2025-05-27 Wang Yang , Hongye Jin , Shaochen Zhong , Song Jiang , Qifan Wang , Vipin Chaudhary , Xiaotian Han

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world…

Machine Learning · Computer Science 2026-03-10 Qianyu Yang , Yang Liu , Jiaqi Li , Jun Bai , Hao Chen , Kaiyuan Chen , Tiliang Duan , Jiayun Dong , Xiaobo Hu , Zixia Jia , Yang Liu , Tao Peng , Yixin Ren , Ran Tian , Zaiyuan Wang , Yanglihong Xiao , Gang Yao , Lingyue Yin , Ge Zhang , Chun Zhang , Jianpeng Jiao , Zilong Zheng , Yuan Gong

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks. While emerging benchmarks have been proposed to evaluate LLMs in various domains such as mathematics and computer…

Artificial Intelligence · Computer Science 2024-10-28 Junnan Dong , Zijin Hong , Yuanchen Bei , Feiran Huang , Xinrun Wang , Xiao Huang