Related papers: Geographically-Informed Language Identification

Evaluation of Geographical Distortions in Language Models

Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language…

Computation and Language · Computer Science 2025-10-29 Rémy Decoupes , Roberto Interdonato , Mathieu Roche , Maguelonne Teisseire , Sarah Valentin

Measuring Geographic Performance Disparities of Offensive Language Classifiers

Text classifiers are applied at scale in the form of one-size-fits-all solutions. Nevertheless, many studies show that classifiers are biased regarding different languages and dialects. When measuring and discovering these biases, some gaps…

Computation and Language · Computer Science 2022-09-16 Brandon Lwowski , Paul Rad , Anthony Rios

Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable…

Computation and Language · Computer Science 2021-04-06 Jonathan Dunn

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context.…

Computation and Language · Computer Science 2020-10-30 Isaac Caswell , Theresa Breiner , Daan van Esch , Ankur Bapna

Mapping Languages: The Corpus of Global Language Use

This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used…

Computation and Language · Computer Science 2020-04-03 Jonathan Dunn

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we…

Computation and Language · Computer Science 2026-04-22 Guy Mor-Lan , Omer Goldman , Matan Eyal , Adi Mayrav Gilady , Sivan Eiger , Idan Szpektor , Avinatan Hassidim , Yossi Matias , Reut Tsarfaty

Metadata Conditioned Large Language Models for Localization

Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31…

Computation and Language · Computer Science 2026-01-22 Anjishnu Mukherjee , Ziwei Zhu , Antonios Anastasopoulos

How Should We Model the Probability of a Language?

Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most…

Computation and Language · Computer Science 2026-02-10 Rasul Dent , Pedro Ortiz Suarez , Thibault Clérice , Benoît Sagot

Geographic and Geopolitical Biases of Language Models

Pretrained language models (PLMs) often fail to fairly represent target users from certain world regions because of the under-representation of those regions in training datasets. With recent PLMs trained on enormous data sources,…

Computation and Language · Computer Science 2022-12-21 Fahim Faisal , Antonios Anastasopoulos

Measuring Geographic Diversity of Foundation Models with a Natural Language--based Geo-guessing Experiment on GPT-4

Generative AI based on foundation models provides a first glimpse into the world represented by machines trained on vast amounts of multimodal data ingested by these models during training. If we consider the resulting models as knowledge…

Computers and Society · Computer Science 2024-04-12 Zilong Liu , Krzysztof Janowicz , Kitty Currier , Meilin Shi

Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes

Providing better language tools for low-resource and endangered languages is imperative for equitable growth. Recent progress with massively multilingual pretrained models has proven surprisingly effective at performing zero-shot transfer…

Computation and Language · Computer Science 2022-11-10 Louis Clouâtre , Prasanna Parthasarathi , Amal Zouaq , Sarath Chandar

Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations

While a large body of work inspects language models for biases concerning gender, race, occupation and religion, biases of geographical nature are relatively less explored. Some recent studies benchmark the degree to which large language…

Computation and Language · Computer Science 2025-02-19 Kirti Bhagat , Kinshuk Vasisht , Danish Pruthi

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the…

Computation and Language · Computer Science 2024-12-02 Angelika Romanou , Negar Foroutan , Anna Sotnikova , Zeming Chen , Sree Harsha Nelaturu , Shivalika Singh , Rishabh Maheshwary , Micol Altomare , Mohamed A. Haggag , Snegha A , Alfonso Amayuelas , Azril Hafizi Amirudin , Viraat Aryabumi , Danylo Boiko , Michael Chang , Jenny Chim , Gal Cohen , Aditya Kumar Dalmia , Abraham Diress , Sharad Duwal , Daniil Dzenhaliou , Daniel Fernando Erazo Florez , Fabian Farestam , Joseph Marvin Imperial , Shayekh Bin Islam , Perttu Isotalo , Maral Jabbarishiviari , Börje F. Karlsson , Eldar Khalilov , Christopher Klamm , Fajri Koto , Dominik Krzemiński , Gabriel Adriano de Melo , Syrielle Montariol , Yiyang Nan , Joel Niklaus , Jekaterina Novikova , Johan Samir Obando Ceron , Debjit Paul , Esther Ploeger , Jebish Purbey , Swati Rajwal , Selvan Sunitha Ravi , Sara Rydell , Roshan Santhosh , Drishti Sharma , Marjana Prifti Skenduli , Arshia Soltani Moakhar , Bardia Soltani Moakhar , Ran Tamir , Ayush Kumar Tarun , Azmine Toushik Wasi , Thenuka Ovin Weerasinghe , Serhan Yilmaz , Mike Zhang , Imanol Schlag , Marzieh Fadaee , Sara Hooker , Antoine Bosselut

GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark

With a fast developing pace of geographic applications, automatable and intelligent models are essential to be designed to handle the large volume of information. However, few researchers focus on geographic natural language processing, and…

Computation and Language · Computer Science 2023-05-12 Dongyang Li , Ruixue Ding , Qiang Zhang , Zheng Li , Boli Chen , Pengjun Xie , Yao Xu , Xin Li , Ning Guo , Fei Huang , Xiaofeng He

Spoken Language Identification using ConvNets

Language Identification (LI) is an important first step in several speech processing systems. With a growing number of voice-based assistants, speech LI has emerged as a widely researched field. To approach the problem of identifying…

Computation and Language · Computer Science 2019-10-11 Sarthak , Shikhar Shukla , Govind Mittal

Geographic Adaptation of Pretrained Language Models

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on…

Computation and Language · Computer Science 2024-01-30 Valentin Hofmann , Goran Glavaš , Nikola Ljubešić , Janet B. Pierrehumbert , Hinrich Schütze

Language Identification for Austronesian Languages

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of…

Computation and Language · Computer Science 2022-06-10 Jonathan Dunn , Wikke Nijhof

Logographic Information Aids Learning Better Representations for Natural Language Inference

Statistical language models conventionally implement representation learning based on the contextual distribution of words or other formal units, whereas any information related to the logographic features of written text are often ignored,…

Computation and Language · Computer Science 2022-11-07 Zijian Jin , Duygu Ataman

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language…

Computation and Language · Computer Science 2026-01-27 Pedro Ortiz Suarez , Laurie Burchell , Catherine Arnett , Rafael Mosquera-Gómez , Sara Hincapie-Monsalve , Thom Vaughan , Damian Stewart , Malte Ostendorff , Idris Abdulmumin , Vukosi Marivate , Shamsuddeen Hassan Muhammad , Atnafu Lambebo Tonja , Hend Al-Khalifa , Nadia Ghezaiel Hammouda , Verrah Otiende , Tack Hwa Wong , Jakhongir Saydaliev , Melika Nobakhtian , Muhammad Ravi Shulthan Habibi , Chalamalasetti Kranti , Carol Muchemi , Khang Nguyen , Faisal Muhammad Adam , Luis Frentzen Salim , Reem Alqifari , Cynthia Amol , Joseph Marvin Imperial , Ilker Kesen , Ahmad Mustafid , Pavel Stepachev , Leshem Choshen , David Anugraha , Hamada Nayel , Seid Muhie Yimam , Vallerie Alexandra Putra , My Chiffon Nguyen , Azmine Toushik Wasi , Gouthami Vadithya , Rob van der Goot , Lanwenn ar C'horr , Karan Dua , Andrew Yates , Mithil Bangera , Yeshil Bangera , Hitesh Laxmichand Patel , Shu Okabe , Fenal Ashokbhai Ilasariya , Dmitry Gaynullin , Genta Indra Winata , Yiyuan Li , Juan Pablo Martínez , Amit Agarwal , Ikhlasul Akmal Hanif , Raia Abu Ahmad , Esther Adenuga , Filbert Aurelian Tjiaranata , Weerayut Buaphet , Michael Anugraha , Sowmya Vajjala , Benjamin Rice , Azril Hafizi Amirudin , Jesujoba O. Alabi , Srikant Panda , Yassine Toughrai , Bruhan Kyomuhendo , Daniel Ruffinelli , Akshata A , Manuel Goulão , Ej Zhou , Ingrid Gabriela Franco Ramirez , Cristina Aggazzotti , Konstantin Dobler , Jun Kevin , Quentin Pagès , Nicholas Andrews , Nuhu Ibrahim , Mattes Ruckdeschel , Amr Keleg , Mike Zhang , Casper Muziri , Saron Samuel , Sotaro Takeshita , Kun Kerdthaisong , Luca Foppiano , Rasul Dent , Tommaso Green , Ahmad Mustapha Wali , Kamohelo Makaaka , Vicky Feliren , Inshirah Idris , Hande Celikkanat , Abdulhamid Abubakar , Jean Maillard , Benoît Sagot , Thibault Clérice , Kenton Murray , Sarah Luger

Geographical Erasure in Language Generation

Large language models (LLMs) encode vast amounts of world knowledge. However, since these models are trained on large swaths of internet data, they are at risk of inordinately capturing information about dominant groups. This imbalance can…

Computation and Language · Computer Science 2023-10-24 Pola Schwöbel , Jacek Golebiowski , Michele Donini , Cédric Archambeau , Danish Pruthi