Related papers: ComplexFuncBench: Exploring Multi-Step and Constra…

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented…

Computation and Language · Computer Science 2025-05-07 Tao Zhang , Chenglin Zhu , Yanjun Shen , Wenjing Luo , Yan Zhang , Hao Liang , Tao Zhang , Fan Yang , Mingan Lin , Yujing Qiao , Weipeng Chen , Bin Cui , Wentao Zhang , Zenan Zhou

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports,…

Computation and Language · Computer Science 2024-06-21 Yushi Bai , Xin Lv , Jiajie Zhang , Hongchang Lyu , Jiankai Tang , Zhidian Huang , Zhengxiao Du , Xiao Liu , Aohan Zeng , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

LongFuncEval: Measuring the effectiveness of long context models for function calling

Multiple recent studies have documented large language models' (LLMs) performance on calling external tools/functions. Others focused on LLMs' abilities to handle longer context lengths. At the intersection of these areas lies another…

Software Engineering · Computer Science 2025-05-19 Kiran Kate , Tejaswini Pedapati , Kinjal Basu , Yara Rizk , Vijil Chenthamarakshan , Subhajit Chaudhury , Mayank Agarwal , Ibrahim Abdelaziz

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive…

Software Engineering · Computer Science 2025-09-12 Jielin Qiu , Zuxin Liu , Zhiwei Liu , Rithesh Murthy , Jianguo Zhang , Haolin Chen , Shiyu Wang , Ming Zhu , Liangwei Yang , Juntao Tan , Zhepeng Cen , Cheng Qian , Shelby Heinecke , Weiran Yao , Silvio Savarese , Caiming Xiong , Huan Wang

LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world…

Computation and Language · Computer Science 2026-01-07 Ziyang Chen , Xing Wu , Junlong Jia , Chaochen Gao , Qi Fu , Debing Zhang , Songlin Hu

HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios

Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark…

Computation and Language · Computer Science 2025-02-18 Jun Wang , Jiamu Zhou , Muning Wen , Xiaoyun Mo , Haoyu Zhang , Qiqiang Lin , Cheng Jin , Xihuai Wang , Weinan Zhang , Qiuying Peng , Jun Wang

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere…

Computation and Language · Computer Science 2026-03-10 Xiaona Xue , Yiqiao Huang , Jiacheng Li , Yuanhang Zheng , Huiqi Miao , Yunfei Ma , Rui Liu , Xinbao Sun , Minglu Liu , Fanyu Meng , Chao Deng , Junlan Feng

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find…

Computation and Language · Computer Science 2025-05-27 Wang Yang , Hongye Jin , Shaochen Zhong , Song Jiang , Qifan Wang , Vipin Chaudhary , Xiaotian Han

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on…

Computation and Language · Computer Science 2025-07-25 Xiaodong Wu , Minhao Wang , Yichen Liu , Xiaoming Shi , He Yan , Xiangju Lu , Junmin Zhu , Wei Zhang

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging…

Computation and Language · Computer Science 2025-01-06 Yushi Bai , Shangqing Tu , Jiajie Zhang , Hao Peng , Xiaozhi Wang , Xin Lv , Shulin Cao , Jiazheng Xu , Lei Hou , Yuxiao Dong , Jie Tang , Juanzi Li

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

Processing and reasoning over long contexts is crucial for many practical applications of Large Language Models (LLMs), such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more…

Computation and Language · Computer Science 2024-02-27 Xinrong Zhang , Yingfa Chen , Shengding Hu , Zihang Xu , Junhao Chen , Moo Khai Hao , Xu Han , Zhen Leng Thai , Shuo Wang , Zhiyuan Liu , Maosong Sun

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world…

Computation and Language · Computer Science 2024-11-01 Bosi Wen , Pei Ke , Xiaotao Gu , Lindong Wu , Hao Huang , Jinfeng Zhou , Wenchuang Li , Binxin Hu , Wendy Gao , Jiaxin Xu , Yiming Liu , Jie Tang , Hongning Wang , Minlie Huang

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response…

Computation and Language · Computer Science 2024-06-06 Yuxin Jiang , Yufei Wang , Xingshan Zeng , Wanjun Zhong , Liangyou Li , Fei Mi , Lifeng Shang , Xin Jiang , Qun Liu , Wei Wang

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

LIFEBench: Evaluating Length Instruction Following in Large Language Models

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally,…

Computation and Language · Computer Science 2025-06-12 Wei Zhang , Zhenhong Zhou , Kun Wang , Junfeng Fang , Yuanhe Zhang , Rui Wang , Ge Zhang , Xavier Li , Li Sun , Lingjuan Lyu , Yang Liu , Sen Su

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack…

Software Engineering · Computer Science 2026-01-01 Ruida Hu , Xinchen Wang , Xin-Cheng Wen , Zhao Zhang , Bo Jiang , Pengfei Gao , Chao Peng , Cuiyun Gao

LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams

Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language…

Computation and Language · Computer Science 2025-04-25 Yongxuan Wu , Runyu Chen , Peiyu Liu , Hongjin Qian

TaskBench: Benchmarking Large Language Models for Task Automation

In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute…

Computation and Language · Computer Science 2024-11-04 Yongliang Shen , Kaitao Song , Xu Tan , Wenqi Zhang , Kan Ren , Siyu Yuan , Weiming Lu , Dongsheng Li , Yueting Zhuang