English
Related papers

Related papers: Geographically-Informed Language Identification

200 papers

Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language…

Computation and Language · Computer Science 2025-10-29 Rémy Decoupes , Roberto Interdonato , Mathieu Roche , Maguelonne Teisseire , Sarah Valentin

Text classifiers are applied at scale in the form of one-size-fits-all solutions. Nevertheless, many studies show that classifiers are biased regarding different languages and dialects. When measuring and discovering these biases, some gaps…

Computation and Language · Computer Science 2022-09-16 Brandon Lwowski , Paul Rad , Anthony Rios

This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable…

Computation and Language · Computer Science 2021-04-06 Jonathan Dunn

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context.…

Computation and Language · Computer Science 2020-10-30 Isaac Caswell , Theresa Breiner , Daan van Esch , Ankur Bapna

This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used…

Computation and Language · Computer Science 2020-04-03 Jonathan Dunn

Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we…

Computation and Language · Computer Science 2026-04-22 Guy Mor-Lan , Omer Goldman , Matan Eyal , Adi Mayrav Gilady , Sivan Eiger , Idan Szpektor , Avinatan Hassidim , Yossi Matias , Reut Tsarfaty

Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31…

Computation and Language · Computer Science 2026-01-22 Anjishnu Mukherjee , Ziwei Zhu , Antonios Anastasopoulos

Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most…

Computation and Language · Computer Science 2026-02-10 Rasul Dent , Pedro Ortiz Suarez , Thibault Clérice , Benoît Sagot

Pretrained language models (PLMs) often fail to fairly represent target users from certain world regions because of the under-representation of those regions in training datasets. With recent PLMs trained on enormous data sources,…

Computation and Language · Computer Science 2022-12-21 Fahim Faisal , Antonios Anastasopoulos

Generative AI based on foundation models provides a first glimpse into the world represented by machines trained on vast amounts of multimodal data ingested by these models during training. If we consider the resulting models as knowledge…

Computers and Society · Computer Science 2024-04-12 Zilong Liu , Krzysztof Janowicz , Kitty Currier , Meilin Shi

Providing better language tools for low-resource and endangered languages is imperative for equitable growth. Recent progress with massively multilingual pretrained models has proven surprisingly effective at performing zero-shot transfer…

Computation and Language · Computer Science 2022-11-10 Louis Clouâtre , Prasanna Parthasarathi , Amal Zouaq , Sarath Chandar

While a large body of work inspects language models for biases concerning gender, race, occupation and religion, biases of geographical nature are relatively less explored. Some recent studies benchmark the degree to which large language…

Computation and Language · Computer Science 2025-02-19 Kirti Bhagat , Kinshuk Vasisht , Danish Pruthi

With a fast developing pace of geographic applications, automatable and intelligent models are essential to be designed to handle the large volume of information. However, few researchers focus on geographic natural language processing, and…

Computation and Language · Computer Science 2023-05-12 Dongyang Li , Ruixue Ding , Qiang Zhang , Zheng Li , Boli Chen , Pengjun Xie , Yao Xu , Xin Li , Ning Guo , Fei Huang , Xiaofeng He

Language Identification (LI) is an important first step in several speech processing systems. With a growing number of voice-based assistants, speech LI has emerged as a widely researched field. To approach the problem of identifying…

Computation and Language · Computer Science 2019-10-11 Sarthak , Shikhar Shukla , Govind Mittal

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on…

Computation and Language · Computer Science 2024-01-30 Valentin Hofmann , Goran Glavaš , Nikola Ljubešić , Janet B. Pierrehumbert , Hinrich Schütze

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of…

Computation and Language · Computer Science 2022-06-10 Jonathan Dunn , Wikke Nijhof

Statistical language models conventionally implement representation learning based on the contextual distribution of words or other formal units, whereas any information related to the logographic features of written text are often ignored,…

Computation and Language · Computer Science 2022-11-07 Zijian Jin , Duygu Ataman

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language…

Computation and Language · Computer Science 2026-01-27 Pedro Ortiz Suarez , Laurie Burchell , Catherine Arnett , Rafael Mosquera-Gómez , Sara Hincapie-Monsalve , Thom Vaughan , Damian Stewart , Malte Ostendorff , Idris Abdulmumin , Vukosi Marivate , Shamsuddeen Hassan Muhammad , Atnafu Lambebo Tonja , Hend Al-Khalifa , Nadia Ghezaiel Hammouda , Verrah Otiende , Tack Hwa Wong , Jakhongir Saydaliev , Melika Nobakhtian , Muhammad Ravi Shulthan Habibi , Chalamalasetti Kranti , Carol Muchemi , Khang Nguyen , Faisal Muhammad Adam , Luis Frentzen Salim , Reem Alqifari , Cynthia Amol , Joseph Marvin Imperial , Ilker Kesen , Ahmad Mustafid , Pavel Stepachev , Leshem Choshen , David Anugraha , Hamada Nayel , Seid Muhie Yimam , Vallerie Alexandra Putra , My Chiffon Nguyen , Azmine Toushik Wasi , Gouthami Vadithya , Rob van der Goot , Lanwenn ar C'horr , Karan Dua , Andrew Yates , Mithil Bangera , Yeshil Bangera , Hitesh Laxmichand Patel , Shu Okabe , Fenal Ashokbhai Ilasariya , Dmitry Gaynullin , Genta Indra Winata , Yiyuan Li , Juan Pablo Martínez , Amit Agarwal , Ikhlasul Akmal Hanif , Raia Abu Ahmad , Esther Adenuga , Filbert Aurelian Tjiaranata , Weerayut Buaphet , Michael Anugraha , Sowmya Vajjala , Benjamin Rice , Azril Hafizi Amirudin , Jesujoba O. Alabi , Srikant Panda , Yassine Toughrai , Bruhan Kyomuhendo , Daniel Ruffinelli , Akshata A , Manuel Goulão , Ej Zhou , Ingrid Gabriela Franco Ramirez , Cristina Aggazzotti , Konstantin Dobler , Jun Kevin , Quentin Pagès , Nicholas Andrews , Nuhu Ibrahim , Mattes Ruckdeschel , Amr Keleg , Mike Zhang , Casper Muziri , Saron Samuel , Sotaro Takeshita , Kun Kerdthaisong , Luca Foppiano , Rasul Dent , Tommaso Green , Ahmad Mustapha Wali , Kamohelo Makaaka , Vicky Feliren , Inshirah Idris , Hande Celikkanat , Abdulhamid Abubakar , Jean Maillard , Benoît Sagot , Thibault Clérice , Kenton Murray , Sarah Luger

Large language models (LLMs) encode vast amounts of world knowledge. However, since these models are trained on large swaths of internet data, they are at risk of inordinately capturing information about dominant groups. This imbalance can…

Computation and Language · Computer Science 2023-10-24 Pola Schwöbel , Jacek Golebiowski , Michele Donini , Cédric Archambeau , Danish Pruthi
‹ Prev 1 2 3 10 Next ›