Related papers: Towards A Generalist Code Embedding Model Based On…

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

Despite the success of text retrieval in many NLP tasks, code retrieval remains a largely underexplored area. Most text retrieval systems are tailored for natural language queries, often neglecting the specific challenges of retrieving…

Software Engineering · Computer Science 2025-08-11 Ye Liu , Rui Meng , Shafiq Joty , Silvio Savarese , Caiming Xiong , Yingbo Zhou , Semih Yavuz

Lessons Learned on Information Retrieval in Electronic Health Records: A Comparison of Embedding Models and Pooling Strategies

Objective: Applying large language models (LLMs) to the clinical domain is challenging due to the context-heavy nature of processing medical records. Retrieval-augmented generation (RAG) offers a solution by facilitating reasoning over…

Computation and Language · Computer Science 2025-08-21 Skatje Myers , Timothy A. Miller , Yanjun Gao , Matthew M. Churpek , Anoop Mayampurath , Dmitriy Dligach , Majid Afshar

Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

Recent advances in large language models (LLMs) have significantly improved automated code generation. While existing approaches have achieved strong performance at the function and file levels, real-world software engineering requires…

Software Engineering · Computer Science 2026-05-21 Yicheng Tao , Yuante Li , Yao Qin , Yepang Liu

Mind the Gap: A Generalized Approach for Cross-Modal Embedding Alignment

Retrieval-Augmented Generation (RAG) systems enhance text generation by incorporating external knowledge but often struggle when retrieving context across different text modalities due to semantic gaps. We introduce a generalized…

Machine Learning · Computer Science 2024-11-01 Arihan Yadav , Alan McMillan

ELITE: Embedding-Less retrieval with Iterative Text Exploration

Large Language Models (LLMs) have achieved impressive progress in natural language processing, but their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks. Retrieval-Augmented…

Computation and Language · Computer Science 2025-05-20 Zhangyu Wang , Siyuan Gao , Rong Zhou , Hao Wang , Li Ning

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code…

Software Engineering · Computer Science 2025-06-04 Zixiang Xian , Chenhui Cui , Rubing Huang , Chunrong Fang , Zhenyu Chen

An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities

Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code…

Software Engineering · Computer Science 2025-01-24 Zezhou Yang , Sirong Chen , Cuiyun Gao , Zhenhao Li , Xing Hu , Kui Liu , Xin Xia

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and…

Software Engineering · Computer Science 2026-05-11 Siqiao Xue , Zihan Liao , Jin Qin , Ziyin Zhang , Yixiang Mu , Fan Zhou , Hang Yu

Language Agnostic Code Embeddings

Recently, code language models have achieved notable advancements in addressing a diverse array of essential code comprehension and generation tasks. Yet, the field lacks a comprehensive deep dive and understanding of the code embeddings of…

Computation and Language · Computer Science 2023-10-26 Saiteja Utpala , Alex Gu , Pin Yu Chen

Embedding API Dependency Graph for Neural Code Generation

The problem of code generation from textual program descriptions has long been viewed as a grand challenge in software engineering. In recent years, many deep learning based approaches have been proposed, which can generate a sequence of…

Software Engineering · Computer Science 2021-04-23 Chen Lyu , Ruyun Wang , Hongyu Zhang , Hanwen Zhang , Songlin Hu

Enhancing Technical Documents Retrieval for RAG

In this paper, we introduce Technical-Embeddings, a novel framework designed to optimize semantic retrieval in technical documentation, with applications in both hardware and software development. Our approach addresses the challenges of…

Information Retrieval · Computer Science 2025-09-05 Songjiang Lai , Tsun-Hin Cheung , Ka-Chun Fung , Kaiwen Xue , Kwan-Ho Lin , Yan-Ming Choi , Vincent Ng , Kin-Man Lam

Towards Effective Code-Integrated Reasoning

In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external…

Computation and Language · Computer Science 2025-06-02 Fei Bai , Yingqian Min , Beichen Zhang , Zhipeng Chen , Wayne Xin Zhao , Lei Fang , Zheng Liu , Zhongyuan Wang , Ji-Rong Wen

ReCode: Improving LLM-based Code Repair with Fine-Grained Retrieval-Augmented Generation

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in code-related tasks, such as code generation and automated program repair. Despite their promising performance, most existing approaches for code…

Software Engineering · Computer Science 2025-09-03 Yicong Zhao , Shisong Chen , Jiacheng Zhang , Zhixu Li

CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval

Code search, framed as information retrieval (IR), underpins modern software engineering and increasingly powers retrieval-augmented generation (RAG), improving code discovery, reuse, and the reliability of LLM-based coding. Yet existing…

Software Engineering · Computer Science 2026-04-20 Jiahui Geng , Qing Li , Fengyu Cai , Fakhri Karray

BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis

Automatic generation of executable Blender code from natural language remains challenging, with state-of-the-art LLMs producing frequent syntactic errors and geometrically inconsistent objects. We present BlenderRAG, a retrieval-augmented…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Massimo Rondelli , Francesco Pivi , Maurizio Gabbrielli

CodeRAG-Bench: Can Retrieval Augment Code Generation?

While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate…

Software Engineering · Computer Science 2025-02-28 Zora Zhiruo Wang , Akari Asai , Xinyan Velocity Yu , Frank F. Xu , Yiqing Xie , Graham Neubig , Daniel Fried

Understanding the Design Decisions of Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) has emerged as a critical technique for enhancing large language model (LLM) capabilities. However, practitioners face significant challenges when making RAG deployment decisions. While existing research…

Software Engineering · Computer Science 2025-07-22 Shengming Zhao , Yuchen Shao , Yuheng Huang , Jiayang Song , Zhijie Wang , Chengcheng Wan , Lei Ma

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning

Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical…

Machine Learning · Computer Science 2022-11-04 Hung Le , Yue Wang , Akhilesh Deepak Gotmare , Silvio Savarese , Steven C. H. Hoi

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the large language models, their specificity in code generation can still be improved due to…

Software Engineering · Computer Science 2025-05-20 Kounianhua Du , Jizheng Chen , Renting Rui , Huacan Chai , Lingyue Fu , Wei Xia , Yasheng Wang , Ruiming Tang , Yong Yu , Weinan Zhang

AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion

Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation…

Software Engineering · Computer Science 2026-01-28 Tianyue Jiang , Yanli Wang , Yanlin Wang , Daya Guo , Ensheng Shi , Yuchi Ma , Jiachi Chen , Zibin Zheng