Related papers: Multi-Programming Language Sandbox for LLMs

FullStack Bench: Evaluating LLMs as Full Stack Coders

As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To…

Artificial Intelligence · Computer Science 2025-05-13 Bytedance-Seed-Foundation-Code-Team , : , Yao Cheng , Jianfeng Chen , Jie Chen , Li Chen , Liyu Chen , Wentao Chen , Zhengyu Chen , Shijie Geng , Aoyan Li , Bo Li , Bowen Li , Linyi Li , Boyi Liu , Jiaheng Liu , Kaibo Liu , Qi Liu , Shukai Liu , Siyao Liu , Tianyi Liu , Tingkai Liu , Yongfei Liu , Rui Long , Jing Mai , Guanghan Ning , Z. Y. Peng , Kai Shen , Jiahao Su , Jing Su , Tao Sun , Yifan Sun , Yunzhe Tao , Guoyin Wang , Siwei Wang , Xuwu Wang , Yite Wang , Zihan Wang , Jinxiang Xia , Liang Xiang , Xia Xiao , Yongsheng Xiao , Chenguang Xi , Shulin Xin , Jingjing Xu , Shikun Xu , Hongxia Yang , Jack Yang , Yingxiang Yang , Jianbo Yuan , Jun Zhang , Yufeng Zhang , Yuyu Zhang , Shen Zheng , He Zhu , Ming Zhu

LLMBox: A Comprehensive Library for Large Language Models

To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a…

Computation and Language · Computer Science 2024-07-09 Tianyi Tang , Yiwen Hu , Bingqian Li , Wenyang Luo , Zijing Qin , Haoxiang Sun , Jiapeng Wang , Shiyi Xu , Xiaoxue Cheng , Geyang Guo , Han Peng , Bowen Zheng , Yiru Tang , Yingqian Min , Yushuo Chen , Jie Chen , Yuanqian Zhao , Luran Ding , Yuhao Wang , Zican Dong , Chunxuan Xia , Junyi Li , Kun Zhou , Wayne Xin Zhao , Ji-Rong Wen

SandboxEval: Towards Securing Test Environment for Untrusted Code

While large language models (LLMs) are powerful assistants in programming tasks, they may also produce malicious code. Testing LLM-generated code therefore poses significant risks to assessment infrastructure tasked with executing untrusted…

Cryptography and Security · Computer Science 2025-04-02 Rafiqul Rabin , Jesse Hostetler , Sean McGregor , Brett Weir , Nick Judd

ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models

Code sandboxes have emerged as a critical infrastructure for advancing the coding capabilities of large language models, providing verifiable feedback for both RL training and evaluation. However, existing systems fail to provide accurate…

Software Engineering · Computer Science 2026-05-01 Jiasheng Zheng , Xin Zheng , Boxi Cao , Pengbo Wang , Zhengzhao Ma , Qiming Zhu , Jiazhen Jiang , Yaojie Lu , Hongyu Lin , Xianpei Han , Le Sun

LitterBox+: An Extensible Framework for LLM-enhanced Scratch Static Code Analysis

Large language models (LLMs) have become an essential tool to support developers using traditional text-based programming languages, but the graphical notation of the block-based Scratch programming environment inhibits the use of LLMs. To…

Software Engineering · Computer Science 2026-02-09 Benedikt Fein , Florian Obermüller , Gordon Fraser

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on…

Computation and Language · Computer Science 2025-04-18 Jiarui Lu , Thomas Holleis , Yizhe Zhang , Bernhard Aumayer , Feng Nan , Felix Bai , Shuang Ma , Shen Ma , Mengyu Li , Guoli Yin , Zirui Wang , Ruoming Pang

Large Language Models for Code Generation: The Practitioners Perspective

Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are…

Software Engineering · Computer Science 2025-01-29 Zeeshan Rasheed , Muhammad Waseem , Kai Kristian Kemell , Aakash Ahmad , Malik Abdul Sami , Jussi Rasku , Kari Systä , Pekka Abrahamsson

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra

PromptBench: A Unified Library for Evaluation of Large Language Models

The evaluation of large language models (LLMs) is crucial to assess their performance and mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to evaluate LLMs. It consists of several key components…

Artificial Intelligence · Computer Science 2024-08-21 Kaijie Zhu , Qinlin Zhao , Hao Chen , Jindong Wang , Xing Xie

MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds

Large Language Model (LLM) alignment aims to ensure that LLM outputs match with human values. Researchers have demonstrated the severity of alignment problems with a large spectrum of jailbreak techniques that can induce LLMs to produce…

Computation and Language · Computer Science 2024-02-06 Xiaolong Jin , Zhuo Zhang , Xiangyu Zhang

A Survey of Calibration Process for Black-Box LLMs

Large Language Models (LLMs) demonstrate remarkable performance in semantic understanding and generation, yet accurately assessing their output reliability remains a significant challenge. While numerous studies have explored calibration…

Artificial Intelligence · Computer Science 2024-12-18 Liangru Xie , Hui Liu , Jingying Zeng , Xianfeng Tang , Yan Han , Chen Luo , Jing Huang , Zhen Li , Suhang Wang , Qi He

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

Large Language Models Synergize with Automated Machine Learning

Recently, program synthesis driven by large language models (LLMs) has become increasingly popular. However, program synthesis for machine learning (ML) tasks still poses significant challenges. This paper explores a novel form of program…

Software Engineering · Computer Science 2024-09-10 Jinglue Xu , Jialong Li , Zhen Liu , Nagar Anthel Venkatesh Suryanarayanan , Guoyuan Zhou , Jia Guo , Hitoshi Iba , Kenji Tei

LLM Agents Should Employ Security Principles

Large Language Model (LLM) agents show considerable promise for automating complex tasks using contextual reasoning; however, interactions involving multiple agents and the system's susceptibility to prompt injection and other forms of…

Cryptography and Security · Computer Science 2025-06-02 Kaiyuan Zhang , Zian Su , Pin-Yu Chen , Elisa Bertino , Xiangyu Zhang , Ninghui Li

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to…

Computation and Language · Computer Science 2024-11-01 Ge Yang , Changyi He , Jinyang Guo , Jianyu Wu , Yifu Ding , Aishan Liu , Haotong Qin , Pengliang Ji , Xianglong Liu

Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights

Large Language Models for Code (or code LLMs) are increasingly gaining popularity and capabilities, offering a wide array of functionalities such as code completion, code generation, code summarization, test generation, code translation,…

Software Engineering · Computer Science 2024-10-18 Rahul Krishna , Rangeet Pan , Raju Pavuluri , Srikanth Tamilselvam , Maja Vukovic , Saurabh Sinha

On Integrating Large Language Models and Scenario-Based Programming for Improving Software Reliability

Large Language Models (LLMs) are fast becoming indispensable tools for software developers, assisting or even partnering with them in crafting complex programs. The advantages are evident -- LLMs can significantly reduce development time,…

Software Engineering · Computer Science 2025-09-12 Ayelet Berzack , Guy Katz

On the Opportunities of Large Language Models for Programming Process Data

Computing educators and researchers have used programming process data to understand how programs are constructed and what sorts of problems students struggle with. Although such data shows promise for using it for feedback, fully automated…

Computers and Society · Computer Science 2024-11-04 John Edwards , Arto Hellas , Juho Leinonen

ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework

Multi-agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code implementation,…

Software Engineering · Computer Science 2025-11-25 Vali Tawosi , Keshav Ramani , Salwa Alamir , Xiaomo Liu

Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis

Recent advances in code generation have illuminated the potential of employing large language models (LLMs) for general-purpose programming languages such as Python and C++, opening new opportunities for automating software development and…

Machine Learning · Computer Science 2025-03-06 Jiahao Gai , Hao Mark Chen , Zhican Wang , Hongyu Zhou , Wanru Zhao , Nicholas Lane , Hongxiang Fan