English
Related papers

Related papers: Language Agnostic Code Embeddings

200 papers

With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing…

Software Engineering · Computer Science 2025-05-07 Micheline Bénédicte Moumoula , Abdoul Kader Kabore , Jacques Klein , Tegawendé Bissyande

Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within…

Computation and Language · Computer Science 2024-01-12 Zhihui Xie , Handong Zhao , Tong Yu , Shuai Li

Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Fangke Chen , Tianhao Dong , Sirry Chen , Guobin Zhang , Yishu Zhang , Yining Chen

Multilingual language models have shown decent performance in multilingual and cross-lingual natural language understanding tasks. However, the power of these multilingual models in code-switching tasks has not been fully explored. In this…

Computation and Language · Computer Science 2021-03-25 Genta Indra Winata , Samuel Cahyawijaya , Zihan Liu , Zhaojiang Lin , Andrea Madotto , Pascale Fung

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code…

Software Engineering · Computer Science 2025-06-04 Zixiang Xian , Chenhui Cui , Rubing Huang , Chunrong Fang , Zhenyu Chen

Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and…

Machine Learning · Computer Science 2019-04-08 Zimin Chen , Martin Monperrus

Cross-lingual word embeddings are vector representations of words in different languages where words with similar meaning are represented by similar vectors, regardless of the language. Recent developments which construct these embeddings…

Computation and Language · Computer Science 2020-03-04 Yerai Doval , Jose Camacho-Collados , Luis Espinosa-Anke , Steven Schockaert

Despite their remarkable ability to capture linguistic nuances across diverse languages, questions persist regarding the degree of alignment between languages in multilingual embeddings. Drawing inspiration from research on high-dimensional…

Computation and Language · Computer Science 2024-05-24 Basel Mousi , Nadir Durrani , Fahim Dalvi , Majd Hawasly , Ahmed Abdelali

The rapid proliferation of diverse programming languages presents both opportunities and challenges for developing multilingual code LLMs. While existing techniques often train code LLMs by simply aggregating multilingual code data, few…

Software Engineering · Computer Science 2025-12-23 Shangbo Yun , Xiaodong Gu , Jianghong Huang , Beijun Shen

We propose an unsupervised method to obtain cross-lingual embeddings without any parallel data or pre-trained word embeddings. The proposed model, which we call multilingual neural language models, takes sentences of multiple languages as…

Computation and Language · Computer Science 2018-09-10 Takashi Wada , Tomoharu Iwata

Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In…

Computation and Language · Computer Science 2021-06-22 Wei Zhao , Steffen Eger , Johannes Bjerva , Isabelle Augenstein

In countries that speak multiple main languages, mixing up different languages within a conversation is commonly called code-switching. Previous works addressing this challenge mainly focused on word-level aspects such as word embeddings.…

Computation and Language · Computer Science 2019-09-19 Genta Indra Winata , Zhaojiang Lin , Jamin Shin , Zihan Liu , Pascale Fung

Recent progress on unsupervised learning of cross-lingual embeddings in bilingual setting has given impetus to learning a shared embedding space for several languages without any supervision. A popular framework to solve the latter problem…

Computation and Language · Computer Science 2020-04-21 Pratik Jawanpuria , Mayank Meghwanshi , Bamdev Mishra

Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of…

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three…

Computation and Language · Computer Science 2025-07-09 Taolin Zhang , Zihan Ma , Maosong Cao , Junnan Liu , Songyang Zhang , Kai Chen

Embedding models have demonstrated strong performance in tasks like clustering, retrieval, and feature extraction while offering computational advantages over generative models and cross-encoders. Benchmarks such as MTEB have shown that…

Software Engineering · Computer Science 2025-08-28 Zhuohao Li , Wenqing Chen , Jianxing Yu , Zhichao Lu

Despite interest in using cross-lingual knowledge to learn word embeddings for various tasks, a systematic comparison of the possible approaches is lacking in the literature. We perform an extensive evaluation of four popular approaches of…

Computation and Language · Computer Science 2016-06-09 Shyam Upadhyay , Manaal Faruqui , Chris Dyer , Dan Roth

Millions of repetitive code snippets are submitted to code repositories every day. To search from these large codebases using simple natural language queries would allow programmers to ideate, prototype, and develop easier and faster.…

Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual…

Computation and Language · Computer Science 2020-11-12 Yerai Doval , Jose Camacho-Collados , Luis Espinosa-Anke , Steven Schockaert

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing…

Computation and Language · Computer Science 2020-02-24 Herman Kamper , Yevgen Matusevych , Sharon Goldwater
‹ Prev 1 2 3 10 Next ›