Related papers: Decoding Secret Memorization in Code LLMs Through …

Compressed code: the hidden effects of quantization and distillation on programming tokens

Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token…

Software Engineering · Computer Science 2026-02-10 Viacheslav Siniaev , Iaroslav Chelombitko , Aleksey Komissarov

Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs

The lack of transparency about code datasets used to train large language models (LLMs) makes it difficult to detect, evaluate, and mitigate data leakage. We present a perturbation-based method to quantify memorization advantage in code…

Software Engineering · Computer Science 2026-04-16 Djiré Albérick Euraste , Kaboré Abdoul Kader , Jordan Samhi , Earl T. Barr , Jacques Klein , Tegawendé F. Bissyandé

Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require…

Computation and Language · Computer Science 2026-02-02 Yifan Zhu , Huiqiang Rong , Haoran Luo

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety.…

Cryptography and Security · Computer Science 2024-07-29 Zhangchen Xu , Fengqing Jiang , Luyao Niu , Jinyuan Jia , Bill Yuchen Lin , Radha Poovendran

Improve Decoding Factuality by Token-wise Cross Layer Entropy of Large Language Models

Despite their impressive capacities, Large language models (LLMs) often struggle with the hallucination issue of generating inaccurate or fabricated content even when they possess correct knowledge. In this paper, we extend the exploration…

Computation and Language · Computer Science 2025-02-06 Jialiang Wu , Yi Shen , Sijia Liu , Yi Tang , Sen Song , Xiaoyi Wang , Longjun Cai

Learning to Decode Collaboratively with Multiple Language Models

We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the…

Computation and Language · Computer Science 2024-08-28 Shannon Zejiang Shen , Hunter Lang , Bailin Wang , Yoon Kim , David Sontag

LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers

Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors, limiting their reliability in knowledge-intensive tasks. While decoding-time strategies provide a promising…

Artificial Intelligence · Computer Science 2025-10-06 Jingze Zhu , Yongliang Wu , Wenbo Zhu , Jiawang Cao , Yanqiang Zheng , Jiawei Chen , Xu Yang , Bernt Schiele , Jonas Fischer , Xinting Hu

Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

Software obfuscation and encryption present persistent challenges for program comprehension and security analysis, particularly when adversaries conceal Indicators of Compromise (IoCs) such as IP addresses within source code. While Large…

Cryptography and Security · Computer Science 2026-05-11 Jaime Morales , Sergio Pastrana , Juan Tapiador

Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach

Concerns regarding the propensity of Large Language Models (LLMs) to produce inaccurate outputs, also known as hallucinations, have escalated. Detecting them is vital for ensuring the reliability of applications relying on LLM-generated…

Computation and Language · Computer Science 2024-05-31 Ernesto Quevedo , Jorge Yero , Rachel Koerner , Pablo Rivas , Tomas Cerny

Constrained Decoding for Secure Code Generation

Code Large Language Models (Code LLMs) have been increasingly used by developers to boost productivity, but they often generate vulnerable code. Thus, there is an urgent need to ensure that code generated by Code LLMs is correct and secure.…

Cryptography and Security · Computer Science 2024-07-23 Yanjun Fu , Ethan Baker , Yu Ding , Yizheng Chen

Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to…

Cryptography and Security · Computer Science 2026-04-21 Meifang Chen , Zhe Yang , Huang Nianchen , Yizhan Huang , Yichen Li , Zihan Li , Michael R. Lyu

Mitigating Sensitive Information Leakage in LLMs4Code through Machine Unlearning

Large Language Models for Code (LLMs4Code) have achieved strong performance in code generation, but recent studies reveal that they may memorize and leak sensitive information contained in training data, posing serious privacy risks. To…

Cryptography and Security · Computer Science 2026-01-29 Shanzhi Gu , Zhaoyang Qu , Ruotong Geng , Mingyang Geng , Shangwen Wang , Chuanfu Xu , Haotian Wang , Zhipeng Lin , Dezun Dong

On LLMs' Internal Representation of Code Correctness

Despite the effectiveness of large language models (LLMs) for code generation, they often output incorrect code. One reason is that model output probabilities are often not well-correlated with correctness, and reflect only the final output…

Software Engineering · Computer Science 2026-01-22 Francisco Ribeiro , Claudio Spiess , Prem Devanbu , Sarah Nadi

CodeCipher: Learning to Obfuscate Source Code Against LLMs

While large code language models have made significant strides in AI-assisted coding tasks, there are growing concerns about privacy challenges. The user code is transparent to the cloud LLM service provider, inducing risks of unauthorized…

Computation and Language · Computer Science 2024-10-10 Yalan Lin , Chengcheng Wan , Yixiong Fang , Xiaodong Gu

Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features

Recent progress in large language models (LLMs) for code generation has raised serious concerns about intellectual property protection. Malicious users can exploit LLMs to produce paraphrased versions of proprietary code that closely…

Artificial Intelligence · Computer Science 2026-01-12 Shinwoo Park , Hyundong Jin , Jeong-won Cha , Yo-Sub Han

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation

Decoding strategies for generative large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Guided by specific hyperparameters, these strategies aim to transform the raw probability distributions…

Computation and Language · Computer Science 2024-12-17 Esteban Garces Arias , Meimingwei Li , Christian Heumann , Matthias Aßenmacher

KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detection

Large Language Models (LLMs) have demonstrated remarkable human-level natural language generation capabilities. However, their potential to generate misinformation, often called the hallucination problem, poses a significant risk to their…

Computation and Language · Computer Science 2023-10-16 Sehyun Choi , Tianqing Fang , Zhaowei Wang , Yangqiu Song

Enhancing Multi-Image Understanding through Delimiter Token Scaling

Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Minyoung Lee , Yeji Park , Dongjun Hwang , Yejin Kim , Seong Joon Oh , Junsuk Choe

LLM Performance for Code Generation on Noisy Tasks

This paper investigates the ability of large language models (LLMs) to recognise and solve tasks which have been obfuscated beyond recognition. Focusing on competitive programming and benchmark tasks (LeetCode and MATH), we compare…

Machine Learning · Computer Science 2025-05-30 Radzim Sendyka , Christian Cabrera , Andrei Paleyes , Diana Robinson , Neil Lawrence

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao