Related papers: Multilingual training for Software Engineering

Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune?

Multilingual language models often perform unevenly across different languages due to limited generalization capabilities for some languages. This issue is significant because of the growing interest in making universal language models that…

Computation and Language · Computer Science 2024-10-11 Gürkan Soykan , Gözde Gül Şahin

Mix Data or Merge Models? Optimizing for Diverse Multi-Task Learning

Large Language Models (LLMs) have been adopted and deployed worldwide for a broad variety of applications. However, ensuring their safe use remains a significant challenge. Preference training and safety measures often overfit to harms…

Computation and Language · Computer Science 2024-10-15 Aakanksha , Arash Ahmadian , Seraphina Goldfarb-Tarrant , Beyza Ermis , Marzieh Fadaee , Sara Hooker

Making Large Language Models Better Data Creators

Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As…

Computation and Language · Computer Science 2023-11-01 Dong-Ho Lee , Jay Pujara , Mohit Sewak , Ryen W. White , Sujay Kumar Jauhar

Learning code summarization from a small and local dataset

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of…

Software Engineering · Computer Science 2022-06-03 Toufique Ahmed , Premkumar Devanbu

On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages

A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual Pre-trained Language Models (PLMs) achieves higher performance as opposed to using a corpus of code written…

Programming Languages · Computer Science 2022-04-21 Fuxiang Chen , Fatemeh Fard , David Lo , Timofey Bryksin

Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects

Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core…

Software Engineering · Computer Science 2025-10-24 Igli Begolli , Meltem Aksoy , Daniel Neider

A Data Management Approach for Dataset Selection Using Human Computation

As the number of applications that use machine learning algorithms increases, the need for labeled data useful for training such algorithms intensifies. Getting labels typically involves employing humans to do the annotation, which directly…

Machine Learning · Computer Science 2013-07-16 Alexandros Ntoulas , Omar Alonso , Vasilis Kandylas

Assessing the Role of Data Quality in Training Bilingual Language Models

Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages…

Computation and Language · Computer Science 2025-06-17 Skyler Seto , Maartje ter Hoeve , Maureen de Seyssel , David Grangier

Revisiting Multilingual Data Mixtures in Language Model Pretraining

The impact of different multilingual data mixtures in pretraining large language models (LLMs) has been a topic of ongoing debate, often raising concerns about potential trade-offs between language coverage and model performance (i.e., the…

Computation and Language · Computer Science 2025-10-31 Negar Foroutan , Paul Teiletche , Ayush Kumar Tarun , Antoine Bosselut

How Many Languages Make Good Multilingual Instruction Tuning? A Case Study on BLOOM

Instruction tuning a large language model with multiple languages can prepare it for multilingual downstream tasks. Nonetheless, it is yet to be determined whether having a handful of languages is sufficient, or whether the benefits…

Computation and Language · Computer Science 2024-12-10 Shaoxiong Ji , Pinzhen Chen

Less Data Less Tokens: Multilingual Unification Learning for Efficient Test-Time Reasoning in LLMs

This paper explores the challenges of test-time scaling of large language models (LLMs), regarding both the data and inference efficiency. We highlight the diversity of multi-lingual reasoning based on our pilot studies, and then introduce…

Computation and Language · Computer Science 2025-06-24 Kang Chen , Mengdi Zhang , Yixin Cao

Making the most of small Software Engineering datasets with modern machine learning

This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software…

Software Engineering · Computer Science 2021-06-30 Julian Aron Prenner , Romain Robbes

A Survey on Domain-Specific Languages for Machine Learning in Big Data

The amount of data generated in the modern society is increasing rapidly. New problems and novel approaches of data capture, storage, analysis and visualization are responsible for the emergence of the Big Data research field. Machine…

Software Engineering · Computer Science 2016-03-09 Ivens Portugal , Paulo Alencar , Donald Cowan

More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in…

Computation and Language · Computer Science 2025-10-09 Yike Zhao , Simin Guo , Ziqing Yang , Shifan Han , Dahua Lin , Fei Tan

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance…

Computation and Language · Computer Science 2025-01-07 Geyu Lin , Bin Wang , Zhengyuan Liu , Nancy F. Chen

Data Cleaning and Machine Learning: A Systematic Literature Review

Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing…

Machine Learning · Computer Science 2024-06-03 Pierre-Olivier Côté , Amin Nikanjam , Nafisa Ahmed , Dmytro Humeniuk , Foutse Khomh

Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding…

Computation and Language · Computer Science 2026-01-26 Alphaeus Dmonte , Vidhi Gupta , Daniel J Perry , Mark Arehart

Automatic Discrimination of Human and Neural Machine Translation in Multilingual Scenarios

We tackle the task of automatically discriminating between human and machine translations. As opposed to most previous work, we perform experiments in a multilingual setting, considering multiple languages and multilingual pretrained…

Computation and Language · Computer Science 2023-06-01 Malina Chichirau , Rik van Noord , Antonio Toral

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code…

Software Engineering · Computer Science 2025-06-04 Zixiang Xian , Chenhui Cui , Rubing Huang , Chunrong Fang , Zhenyu Chen

Multilingual Hierarchical Attention Networks for Document Classification

Hierarchical attention networks have recently achieved remarkable performance for document classification in a given language. However, when multilingual document collections are considered, training such models separately for each language…

Computation and Language · Computer Science 2017-09-18 Nikolaos Pappas , Andrei Popescu-Belis