Related papers: Code Representation Learning At Scale

CodeRetriever: Unimodal and Bimodal Contrastive Learning for Code Search

In this paper, we propose the CodeRetriever model, which learns the function-level code semantic representations through large-scale code-text contrastive pre-training. We adopt two contrastive learning schemes in CodeRetriever: unimodal…

Computation and Language · Computer Science 2022-10-27 Xiaonan Li , Yeyun Gong , Yelong Shen , Xipeng Qiu , Hang Zhang , Bolun Yao , Weizhen Qi , Daxin Jiang , Weizhu Chen , Nan Duan

Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning

Recent advances in large-scale code generation models have led to remarkable progress in producing high-quality code. These models are trained in a self-supervised manner on extensive unlabeled code corpora using a decoder-only…

Software Engineering · Computer Science 2026-02-12 Jiayi Lin , Yanlin Wang , Yibiao Yang , Lei Zhang , Yutao Xie

Hybrid Generative-Contrastive Representation Learning

Unsupervised representation learning has recently received lots of interest due to its powerful generalizability through effectively leveraging large-scale unlabeled data. There are two prevalent approaches for this, contrastive learning…

Machine Learning · Computer Science 2021-06-14 Saehoon Kim , Sungwoong Kim , Juho Lee

Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Code contrastive pre-training has recently achieved significant progress on code-related tasks. In this paper, we present \textbf{SCodeR}, a \textbf{S}oft-labeled contrastive pre-training framework with two positive sample construction…

Computation and Language · Computer Science 2022-10-27 Xiaonan Li , Daya Guo , Yeyun Gong , Yun Lin , Yelong Shen , Xipeng Qiu , Daxin Jiang , Weizhu Chen , Nan Duan

About contrastive unsupervised representation learning for classification and its convergence

Contrastive representation learning has been recently proved to be very efficient for self-supervised training. These methods have been successfully used to train encoders which perform comparably to supervised training on downstream…

Machine Learning · Computer Science 2020-12-03 Ibrahim Merad , Yiyang Yu , Emmanuel Bacry , Stéphane Gaïffas

ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning

Large-scale pre-trained models such as CodeBERT, GraphCodeBERT have earned widespread attention from both academia and industry. Attributed to the superior ability in code representation, they have been further applied in multiple…

Software Engineering · Computer Science 2023-01-24 Shangqing Liu , Bozhi Wu , Xiaofei Xie , Guozhu Meng , Yang Liu

e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce

Understanding vision and language representations of product content is vital for search and recommendation applications in e-commerce. As a backbone for online shopping platforms and inspired by the recent success in representation…

Machine Learning · Computer Science 2022-08-23 Wonyoung Shin , Jonghun Park , Taekang Woo , Yongwoo Cho , Kwangjin Oh , Hwanjun Song

ContraCLM: Contrastive Learning For Causal Language Model

Despite exciting progress in causal language models, the expressiveness of the representations is largely limited due to poor discrimination ability. To remedy this issue, we present ContraCLM, a novel contrastive learning framework at both…

Computation and Language · Computer Science 2023-05-04 Nihal Jain , Dejiao Zhang , Wasi Uddin Ahmad , Zijian Wang , Feng Nan , Xiaopeng Li , Ming Tan , Ramesh Nallapati , Baishakhi Ray , Parminder Bhatia , Xiaofei Ma , Bing Xiang

CONCORD: Clone-aware Contrastive Learning for Source Code

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE…

Software Engineering · Computer Science 2023-06-07 Yangruibo Ding , Saikat Chakraborty , Luca Buratti , Saurabh Pujar , Alessandro Morari , Gail Kaiser , Baishakhi Ray

Should We Still Pretrain Encoders with Masked Language Modeling?

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with…

Computation and Language · Computer Science 2026-05-06 Hippolyte Gisserot-Boukhlef , Nicolas Boizard , Manuel Faysse , Duarte M. Alves , Emmanuel Malherbe , André F. T. Martins , Céline Hudelot , Pierre Colombo

Function Contrastive Learning of Transferable Meta-Representations

Meta-learning algorithms adapt quickly to new tasks that are drawn from the same task distribution as the training tasks. The mechanism leading to fast adaptation is the conditioning of a downstream predictive model on the inferred…

Machine Learning · Computer Science 2021-07-23 Muhammad Waleed Gondal , Shruti Joshi , Nasim Rahaman , Stefan Bauer , Manuel Wüthrich , Bernhard Schölkopf

Language models scale reliably with over-training and on downstream tasks

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Computation and Language · Computer Science 2024-06-18 Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang , Jeffrey Li , Sedrick Keh , Rui Xin , Marianna Nezhurina , Igor Vasiljevic , Jenia Jitsev , Luca Soldaini , Alexandros G. Dimakis , Gabriel Ilharco , Pang Wei Koh , Shuran Song , Thomas Kollar , Yair Carmon , Achal Dave , Reinhard Heckel , Niklas Muennighoff , Ludwig Schmidt

Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Dense retrieval has shown promising results in many information retrieval (IR) related tasks, whose foundation is high-quality text representation learning for effective search. Some recent studies have shown that autoencoder-based language…

Information Retrieval · Computer Science 2022-04-25 Xinyu Ma , Jiafeng Guo , Ruqing Zhang , Yixing Fan , Xueqi Cheng

CoCoSoDa: Effective Contrastive Learning for Code Search

Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved…

Software Engineering · Computer Science 2023-02-14 Ensheng Shi , Yanlin Wang , Wenchao Gu , Lun Du , Hongyu Zhang , Shi Han , Dongmei Zhang , Hongbin Sun

Bi-Granularity Contrastive Learning for Post-Training in Few-Shot Scene

The major paradigm of applying a pre-trained language model to downstream tasks is to fine-tune it on labeled task data, which often suffers instability and low performance when the labeled examples are scarce.~One way to alleviate this…

Computation and Language · Computer Science 2021-06-07 Ruikun Luo , Guanhuan Huang , Xiaojun Quan

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the…

Computation and Language · Computer Science 2026-03-10 Chengyin Xu , Kaiyuan Chen , Xiao Li , Ke Shen , Chenggang Li

Curriculum Learning for Small Code Language Models

Code language models have emerged as useful tools for various programming tasks, yet they often struggle when it comes to complex ones. In this paper, we explore the potential of curriculum learning in enhancing the performance of these…

Machine Learning · Computer Science 2024-07-16 Marwa Naïr , Kamel Yamani , Lynda Said Lhadj , Riyadh Baghdadi

Improving Code Search with Hard Negative Sampling Based on Fine-tuning

Pre-trained code models have emerged as the state-of-the-art paradigm for code search tasks. The paradigm involves pre-training the model on search-irrelevant tasks such as masked language modeling, followed by the fine-tuning stage, which…

Software Engineering · Computer Science 2024-11-25 Hande Dong , Jiayi Lin , Yanlin Wang , Yichong Leng , Jiawei Chen , Yutao Xie

Contrastive Difference Predictive Coding

Predicting and reasoning about the future lie at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the…

Machine Learning · Computer Science 2025-10-10 Chongyi Zheng , Ruslan Salakhutdinov , Benjamin Eysenbach

Contrastive Code Representation Learning

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program…

Machine Learning · Computer Science 2022-01-10 Paras Jain , Ajay Jain , Tianjun Zhang , Pieter Abbeel , Joseph E. Gonzalez , Ion Stoica