English
Related papers

Related papers: ComplexFuncBench: Exploring Multi-Step and Constra…

200 papers

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented…

Computation and Language · Computer Science 2025-05-07 Tao Zhang , Chenglin Zhu , Yanjun Shen , Wenjing Luo , Yan Zhang , Hao Liang , Tao Zhang , Fan Yang , Mingan Lin , Yujing Qiao , Weipeng Chen , Bin Cui , Wentao Zhang , Zenan Zhou

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports,…

Computation and Language · Computer Science 2024-06-21 Yushi Bai , Xin Lv , Jiajie Zhang , Hongchang Lyu , Jiankai Tang , Zhidian Huang , Zhengxiao Du , Xiao Liu , Aohan Zeng , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

Multiple recent studies have documented large language models' (LLMs) performance on calling external tools/functions. Others focused on LLMs' abilities to handle longer context lengths. At the intersection of these areas lies another…

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive…

The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world…

Computation and Language · Computer Science 2026-01-07 Ziyang Chen , Xing Wu , Junlong Jia , Chaochen Gao , Qi Fu , Debing Zhang , Songlin Hu

Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark…

Computation and Language · Computer Science 2025-02-18 Jun Wang , Jiamu Zhou , Muning Wen , Xiaoyun Mo , Haoyu Zhang , Qiqiang Lin , Cheng Jin , Xihuai Wang , Weinan Zhang , Qiuying Peng , Jun Wang

Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere…

Computation and Language · Computer Science 2026-03-10 Xiaona Xue , Yiqiao Huang , Jiacheng Li , Yuanhang Zheng , Huiqi Miao , Yunfei Ma , Rui Liu , Xinbao Sun , Minglu Liu , Fanyu Meng , Chao Deng , Junlan Feng

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find…

Computation and Language · Computer Science 2025-05-27 Wang Yang , Hongye Jin , Shaochen Zhong , Song Jiang , Qifan Wang , Vipin Chaudhary , Xiaotian Han

As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on…

Computation and Language · Computer Science 2025-07-25 Xiaodong Wu , Minhao Wang , Yichen Liu , Xiaoming Shi , He Yan , Xiangju Lu , Junmin Zhu , Wei Zhang

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging…

Computation and Language · Computer Science 2025-01-06 Yushi Bai , Shangqing Tu , Jiajie Zhang , Hao Peng , Xiaozhi Wang , Xin Lv , Shulin Cao , Jiazheng Xu , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more…

Computation and Language · Computer Science 2024-02-27 Xinrong Zhang , Yingfa Chen , Shengding Hu , Zihang Xu , Junhao Chen , Moo Khai Hao , Xu Han , Zhen Leng Thai , Shuo Wang , Zhiyuan Liu , Maosong Sun

Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world…

Computation and Language · Computer Science 2024-11-01 Bosi Wen , Pei Ke , Xiaotao Gu , Lindong Wu , Hao Huang , Jinfeng Zhou , Wenchuang Li , Binxin Hu , Wendy Gao , Jiaxin Xu , Yiming Liu , Jie Tang , Hongning Wang , Minlie Huang

The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response…

Computation and Language · Computer Science 2024-06-06 Yuxin Jiang , Yufei Wang , Xingshan Zeng , Wanjun Zhong , Liangyou Li , Fei Mi , Lifeng Shang , Xin Jiang , Qun Liu , Wei Wang

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally,…

Computation and Language · Computer Science 2025-06-12 Wei Zhang , Zhenhong Zhou , Kun Wang , Junfeng Fang , Yuanhe Zhang , Rui Wang , Ge Zhang , Xavier Li , Li Sun , Lingjuan Lyu , Yang Liu , Sen Su

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack…

Software Engineering · Computer Science 2026-01-01 Ruida Hu , Xinchen Wang , Xin-Cheng Wen , Zhao Zhang , Bo Jiang , Pengfei Gao , Chao Peng , Cuiyun Gao

Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language…

Computation and Language · Computer Science 2025-04-25 Yongxuan Wu , Runyu Chen , Peiyu Liu , Hongjin Qian

In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute…

Computation and Language · Computer Science 2024-11-04 Yongliang Shen , Kaitao Song , Xu Tan , Wenqi Zhang , Kan Ren , Siyu Yuan , Weiming Lu , Dongsheng Li , Yueting Zhuang
‹ Prev 1 2 3 10 Next ›