English
Related papers

Related papers: CL-bench: A Benchmark for Context Learning

200 papers

Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of…

While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific contexts.…

Artificial Intelligence · Computer Science 2026-05-26 Hongbo Jin , Mingnan Zhu , Jingqi Tian , Xu Jiang , Zhongjing Du , Haoran Tang , Siyi Xie , Qiaoman Zhang , Jiayu Ding

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports,…

Computation and Language · Computer Science 2024-06-21 Yushi Bai , Xin Lv , Jiajie Zhang , Hongchang Lyu , Jiankai Tang , Zhidian Huang , Zhengxiao Du , Xiao Liu , Aohan Zeng , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various…

Computation and Language · Computer Science 2024-02-02 Yilun Zhu , Joel Ruben Antony Moniz , Shruti Bhargava , Jiarui Lu , Dhivya Piraviperumal , Site Li , Yuan Zhang , Hong Yu , Bo-Hsiang Tseng

We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only…

Computer Vision and Pattern Recognition · Computer Science 2026-05-14 Yifan Chen , Fei Yin , Qingyan Bai , Zicheng Lin , Yujiu Yang

Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to…

Computation and Language · Computer Science 2024-04-09 Xuanfan Ni , Hengyi Cai , Xiaochi Wei , Shuaiqiang Wang , Dawei Yin , Piji Li

The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world…

Computation and Language · Computer Science 2026-01-07 Ziyang Chen , Xing Wu , Junlong Jia , Chaochen Gao , Qi Fu , Debing Zhang , Songlin Hu

LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during…

Machine Learning · Computer Science 2026-02-12 Han Li , Letian Zhu , Bohan Zhang , Rili Feng , Jiaming Wang , Yue Pan , Earl T. Barr , Federica Sarro , Zhaoyang Chu , He Ye

Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they…

Computation and Language · Computer Science 2023-09-29 Zhiwei Fei , Xiaoyu Shen , Dawei Zhu , Fengzhe Zhou , Zhuo Han , Songyang Zhang , Kai Chen , Zongwen Shen , Jidong Ge

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier…

Artificial Intelligence · Computer Science 2026-04-20 Ankit Maloo

People often encounter role conflicts -- social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) increasingly navigate these social dynamics, a critical research…

Computation and Language · Computer Science 2026-04-20 Jisu Shin , Hoyun Song , Juhyun Oh , Changgeon Ko , Eunsu Kim , Chani Jung , Alice Oh

The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented…

Computation and Language · Computer Science 2025-05-07 Tao Zhang , Chenglin Zhu , Yanjun Shen , Wenjing Luo , Yan Zhang , Hao Liang , Tao Zhang , Fan Yang , Mingan Lin , Yujing Qiao , Weipeng Chen , Bin Cui , Wentao Zhang , Zenan Zhou

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging…

Computation and Language · Computer Science 2025-01-06 Yushi Bai , Shangqing Tu , Jiajie Zhang , Hao Peng , Xiaozhi Wang , Xin Lv , Shulin Cao , Jiazheng Xu , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

Large language models (LLMs) have demonstrated significant potential in advancing various fields of research and society. However, the current community of LLMs overly focuses on benchmarks for analyzing specific foundational skills (e.g.…

The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models -- LLMs finetuned…

Computation and Language · Computer Science 2025-03-21 Austin Xu , Srijan Bansal , Yifei Ming , Semih Yavuz , Shafiq Joty

Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere…

Computation and Language · Computer Science 2026-03-10 Xiaona Xue , Yiqiao Huang , Jiacheng Li , Yuanhang Zheng , Huiqi Miao , Yunfei Ma , Rui Liu , Xinbao Sun , Minglu Liu , Fanyu Meng , Chao Deng , Junlan Feng

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to…

Computation and Language · Computer Science 2024-06-13 Tianle Li , Ge Zhang , Quy Duc Do , Xiang Yue , Wenhu Chen

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find…

Computation and Language · Computer Science 2025-05-27 Wang Yang , Hongye Jin , Shaochen Zhong , Song Jiang , Qifan Wang , Vipin Chaudhary , Xiaotian Han

As language models (LMs) evolve from chat assistants to long-horizon agents capable of multi-step reasoning and tool use, existing benchmarks remain largely confined to structured or exam-style tasks that fall short of real-world…

Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks. While emerging benchmarks have been proposed to evaluate LLMs in various domains such as mathematics and computer…

Artificial Intelligence · Computer Science 2024-10-28 Junnan Dong , Zijin Hong , Yuanchen Bei , Feiran Huang , Xinrun Wang , Xiao Huang
‹ Prev 1 2 3 10 Next ›