Related papers: Language Agnostic Code Embeddings

The Struggles of LLMs in Cross-lingual Code Clone Detection

With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing…

Software Engineering · Computer Science 2025-05-07 Micheline Bénédicte Moumoula , Abdoul Kader Kabore , Jacques Klein , Tegawendé Bissyande

Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within…

Computation and Language · Computer Science 2024-01-12 Zhihui Xie , Handong Zhao , Tong Yu , Shuai Li

Language-Agnostic Visual Embeddings for Cross-Script Handwriting Retrieval

Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Fangke Chen , Tianhao Dong , Sirry Chen , Guobin Zhang , Yishu Zhang , Yining Chen

Are Multilingual Models Effective in Code-Switching?

Multilingual language models have shown decent performance in multilingual and cross-lingual natural language understanding tasks. However, the power of these multilingual models in code-switching tasks has not been fully explored. In this…

Computation and Language · Computer Science 2021-03-25 Genta Indra Winata , Samuel Cahyawijaya , Zihan Liu , Zhaojiang Lin , Andrea Madotto , Pascale Fung

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code…

Software Engineering · Computer Science 2025-06-04 Zixiang Xian , Chenhui Cui , Rubing Huang , Chunrong Fang , Zhenyu Chen

A Literature Study of Embeddings on Source Code

Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and…

Machine Learning · Computer Science 2019-04-08 Zimin Chen , Martin Monperrus

On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning

Cross-lingual word embeddings are vector representations of words in different languages where words with similar meaning are represented by similar vectors, regardless of the language. Recent developments which construct these embeddings…

Computation and Language · Computer Science 2020-03-04 Yerai Doval , Jose Camacho-Collados , Luis Espinosa-Anke , Steven Schockaert

Exploring Alignment in Shared Cross-lingual Spaces

Despite their remarkable ability to capture linguistic nuances across diverse languages, questions persist regarding the degree of alignment between languages in multilingual embeddings. Drawing inspiration from research on high-dimensional…

Computation and Language · Computer Science 2024-05-24 Basel Mousi , Nadir Durrani , Fahim Dalvi , Majd Hawasly , Ahmed Abdelali

Beyond Language Boundaries: Uncovering Programming Language Families for Code Language Models

The rapid proliferation of diverse programming languages presents both opportunities and challenges for developing multilingual code LLMs. While existing techniques often train code LLMs by simply aggregating multilingual code data, few…

Software Engineering · Computer Science 2025-12-23 Shangbo Yun , Xiaodong Gu , Jianghong Huang , Beijun Shen

Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models

We propose an unsupervised method to obtain cross-lingual embeddings without any parallel data or pre-trained word embeddings. The proposed model, which we call multilingual neural language models, takes sentences of multiple languages as…

Computation and Language · Computer Science 2018-09-10 Takashi Wada , Tomoharu Iwata

Inducing Language-Agnostic Multilingual Representations

Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In…

Computation and Language · Computer Science 2021-06-22 Wei Zhao , Steffen Eger , Johannes Bjerva , Isabelle Augenstein

Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition

In countries that speak multiple main languages, mixing up different languages within a conversation is commonly called code-switching. Previous works addressing this challenge mainly focused on word-level aspects such as word embeddings.…

Computation and Language · Computer Science 2019-09-19 Genta Indra Winata , Zhaojiang Lin , Jamin Shin , Zihan Liu , Pascale Fung

A Simple Approach to Learning Unsupervised Multilingual Embeddings

Recent progress on unsupervised learning of cross-lingual embeddings in bilingual setting has given impetus to learning a shared embedding space for several languages without any supervision. A popular framework to solve the latter problem…

Computation and Language · Computer Science 2020-04-21 Pratik Jawanpuria , Mayank Meghwanshi , Bamdev Mishra

Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment

Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of…

Machine Learning · Computer Science 2026-03-24 Aleksander Boruch-Gruszecki , Yangtian Zi , Zixuan Wu , Tejas Oberoi , Carolyn Jane Anderson , Joydeep Biswas , Arjun Guha

Coding Triangle: How Does Large Language Model Understand Code?

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three…

Computation and Language · Computer Science 2025-07-09 Taolin Zhang , Zihan Ma , Maosong Cao , Junnan Liu , Songyang Zhang , Kai Chen

Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking

Embedding models have demonstrated strong performance in tasks like clustering, retrieval, and feature extraction while offering computational advantages over generative models and cross-encoders. Benchmarks such as MTEB have shown that…

Software Engineering · Computer Science 2025-08-28 Zhuohao Li , Wenqing Chen , Jianxing Yu , Zhichao Lu

Cross-lingual Models of Word Embeddings: An Empirical Comparison

Despite interest in using cross-lingual knowledge to learn word embeddings for various tasks, a systematic comparison of the possible approaches is lacking in the literature. We perform an extensive evaluation of four popular approaches of…

Computation and Language · Computer Science 2016-06-09 Shyam Upadhyay , Manaal Faruqui , Chris Dyer , Dan Roth

BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?

Millions of repetitive code snippets are submitted to code repositories every day. To search from these large codebases using simple natural language queries would allow programmers to ideate, prototype, and develop easier and faster.…

Software Engineering · Computer Science 2021-04-19 Abdullah Al Ishtiaq , Masum Hasan , Md. Mahim Anjum Haque , Kazi Sajeed Mehrab , Tanveer Muttaqueen , Tahmid Hasan , Anindya Iqbal , Rifat Shahriyar

Meemi: A Simple Method for Post-processing and Integrating Cross-lingual Word Embeddings

Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual…

Computation and Language · Computer Science 2020-11-12 Yerai Doval , Jose Camacho-Collados , Luis Espinosa-Anke , Steven Schockaert

Multilingual acoustic word embedding models for processing zero-resource languages

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing…

Computation and Language · Computer Science 2020-02-24 Herman Kamper , Yevgen Matusevych , Sharon Goldwater