Related papers: Multicalibration for LLM-based Code Generation

Multicalibration for Confidence Scoring in LLMs

This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs). Multicalibration asks for calibration not just marginally, but simultaneously…

Machine Learning · Statistics 2024-04-09 Gianluca Detommaso , Martin Bertran , Riccardo Fogliato , Aaron Roth

Calibrating Long-form Generations from Large Language Models

To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods…

Computation and Language · Computer Science 2024-10-29 Yukun Huang , Yixin Liu , Raghuveer Thirukovalluru , Arman Cohan , Bhuwan Dhingra

A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks

LLMs enable qualitative coding at large scale, but assessing reliability remains challenging where human experts seldom agree. We investigate confidence-diversity calibration as a quality assessment framework for accessible coding tasks…

Machine Learning · Computer Science 2025-08-19 Zhilong Zhao , Yindi Liu

Calibration and Correctness of Language Models for Code

Machine learning models are widely used, but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the…

Software Engineering · Computer Science 2024-08-22 Claudio Spiess , David Gros , Kunal Suresh Pai , Michael Pradel , Md Rafiqul Islam Rabin , Amin Alipour , Susmit Jha , Prem Devanbu , Toufique Ahmed

Localized Calibrated Uncertainty in Code Language Models

Large Language models (LLMs) can generate complicated source code from natural language prompts. However, LLMs can generate output that deviates from what the user wants, requiring supervision and editing. To support this process, we offer…

Software Engineering · Computer Science 2026-01-01 David Gros , Prem Devanbu

Calibrating Large Language Models with Sample Consistency

Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their…

Computation and Language · Computer Science 2026-02-24 Qing Lyu , Kumar Shridhar , Chaitanya Malaviya , Li Zhang , Yanai Elazar , Niket Tandon , Marianna Apidianaki , Mrinmaya Sachan , Chris Callison-Burch

On the Calibration of Multilingual Question Answering LLMs

Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known…

Computation and Language · Computer Science 2024-04-16 Yahan Yang , Soham Dan , Dan Roth , Insup Lee

When is Multicalibration Post-Processing Necessary?

Calibration is a well-studied property of predictors which guarantees meaningful uncertainty estimates. Multicalibration is a related notion -- originating in algorithmic fairness -- which requires predictors to be simultaneously calibrated…

Machine Learning · Computer Science 2024-11-06 Dutch Hansen , Siddartha Devic , Preetum Nakkiran , Vatsal Sharan

A Close Look into the Calibration of Pre-trained Language Models

Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty. We take a close look into this problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training…

Computation and Language · Computer Science 2023-05-09 Yangyi Chen , Lifan Yuan , Ganqu Cui , Zhiyuan Liu , Heng Ji

Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency

The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has…

Software Engineering · Computer Science 2025-05-06 Nazmus Ashrafi , Salah Bouktif , Mohammed Mediani

On the Calibration of Large Language Models and Alignment

As large language models attract increasing attention and find widespread application, concurrent challenges of reliability also arise at the same time. Confidence calibration, an effective analysis method for gauging the reliability of…

Computation and Language · Computer Science 2023-11-23 Chiwei Zhu , Benfeng Xu , Quan Wang , Yongdong Zhang , Zhendong Mao

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

Language models (LMs) have exhibited impressive abilities in generating codes from natural language requirements. In this work, we highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation…

Software Engineering · Computer Science 2024-08-28 Heejae Chon , Seonghyeon Lee , Jinyoung Yeo , Dongha Lee

A Survey on Evaluating Large Language Models in Code Generation Tasks

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision

In today's AI-assisted software engineering landscape, developers increasingly depend on LLMs that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can reduce developer productivity. To…

Software Engineering · Computer Science 2026-04-09 Hong Yi Lin , Chunhua Liu , Haoyu Gao , Patanamon Thongtanunam , Christoph Treude

Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach

Code Large Language Models (CLLMs) have exhibited outstanding performance in program synthesis, attracting the focus of the research community. The evaluation of CLLM's program synthesis capability has generally relied on manually curated…

Software Engineering · Computer Science 2025-05-13 Longtian Wang , Tianlin Li , Xiaofei Xie , Yuhan Zhi , Jian Wang , Chao Shen

Uncertainty Awareness of Large Language Models Under Code Distribution Shifts: A Benchmark Study

Large Language Models (LLMs) have been widely employed in programming language analysis to enhance human productivity. Yet, their reliability can be compromised by various code distribution shifts, leading to inconsistent outputs. While…

Software Engineering · Computer Science 2024-02-12 Yufei Li , Simin Chen , Yanghong Guo , Wei Yang , Yue Dong , Cong Liu

Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis

Artificial Intelligence (AI)-driven code generation tools are increasingly used throughout the software development lifecycle to accelerate coding tasks. However, the security of AI-generated code using Large Language Models (LLMs) remains…

Cryptography and Security · Computer Science 2026-03-10 Mohammed Kharma , Soohyeon Choi , Mohammed AlKhanafseh , David Mohaisen

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

$ $Large Language Models (LLMs) are being increasingly utilized in various applications, with code generations being a notable example. While previous research has shown that LLMs have the capability to generate both secure and insecure…

Cryptography and Security · Computer Science 2025-09-30 Ran Elgedawy , Porter Dosch , John Sadik , Senjuti Dutta , Anuj Gautam , Konstantinos Georgiou , Farzin Gholamrezae , Fujiao Ji , Kyungchan Lim , Qian Liu , Scott Ruoti

Calibrating Beyond English: Language Diversity for Better Quantized Multilingual LLM

Quantization is an effective technique for reducing the storage footprint and computational costs of Large Language Models (LLMs), but it often results in performance degradation. Existing post-training quantization methods typically use…

Computation and Language · Computer Science 2026-01-27 Everlyn Asiko Chimoto , Mostafa Elhoushi , Bruce A. Bassett

Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on…

Computation and Language · Computer Science 2025-10-06 Aakriti Agrawal , Rohith Aralikatti , Anirudh Satheesh , Souradip Chakraborty , Amrit Singh Bedi , Furong Huang