Related papers: Developing Open Data Models for Linguistic Field D…

A Query Language for Multi-version Data Web Archives

The Data Web refers to the vast and rapidly increasing quantity of scientific, corporate, government and crowd-sourced data published in the form of Linked Open Data, which encourages the uniform representation of heterogeneous data items…

Databases · Computer Science 2016-05-13 Marios Meimaris , George Papastefanatos , Stratis Viglas , Yannis Stavrakas , Christos Pateritsas , Ioannis Anagnostopoulos

The Dawn of Open Access to Phylogenetic Data

The scientific enterprise depends critically on the preservation of and open access to published data. This basic tenet applies acutely to phylogenies (estimates of evolutionary relationships among species). Increasingly, phylogenies are…

Populations and Evolution · Quantitative Biology 2017-02-08 Andrew F. Magee , Michael R. May , Brian R. Moore

Qualitative Coding Analysis through Open-Source Large Language Models: A User Study and Design Recommendations

Qualitative data analysis is labor-intensive, yet the privacy risks associated with commercial Large Language Models (LLMs) often preclude their use in sensitive research. To address this, we introduce ChatQDA, an on-device framework…

Human-Computer Interaction · Computer Science 2026-02-23 Tung T. Ngo , Dai Nguyen Van , Anh-Minh Nguyen , Phuong-Anh Do , Anh Nguyen-Quoc

How Vulnerable Are Edge LLMs?

Large language models (LLMs) are increasingly deployed on edge devices under strict computation and quantization constraints, yet their security implications remain unclear. We study query-based knowledge extraction from quantized…

Cryptography and Security · Computer Science 2026-03-26 Ao Ding , Hongzong Li , Zi Liang , Zhanpeng Shi , Shuxin Zhuang , Shiqin Tang , Rong Feng , Ping Lu

Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection

Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and…

Computation and Language · Computer Science 2024-12-17 Aso Mahmudi , Borja Herce , Demian Inostroza Amestica , Andreas Scherbakov , Eduard Hovy , Ekaterina Vylomova

Socially Responsible Data for Large Multilingual Language Models

Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years, but their training data is largely English text. There is growing interest in multilingual LLMs, and various efforts are striving…

Computation and Language · Computer Science 2024-09-10 Andrew Smart , Ben Hutchinson , Lameck Mbangula Amugongo , Suzanne Dikker , Alex Zito , Amber Ebinama , Zara Wudiri , Ding Wang , Erin van Liemt , João Sedoc , Seyi Olojo , Stanley Uwakwe , Edem Wornyo , Sonja Schmer-Galunder , Jamila Smith-Loud

Uncertainty Quantification of Large Language Models through Multi-Dimensional Responses

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks due to large training datasets and powerful transformer architecture. However, the reliability of responses from LLMs remains a question.…

Computation and Language · Computer Science 2025-02-26 Tiejin Chen , Xiaoou Liu , Longchao Da , Jia Chen , Vagelis Papalexakis , Hua Wei

Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research

Low-resource languages serve as invaluable repositories of human history, embodying cultural evolution and intellectual diversity. Despite their significance, these languages face critical challenges, including data scarcity and…

Computation and Language · Computer Science 2026-04-20 Tianyang Zhong , Zhenyuan Yang , Zhengliang Liu , Ruidong Zhang , Weihang You , Yiheng Liu , Haiyang Sun , Yi Pan , Yiwei Li , Yifan Zhou , Hanqi Jiang , Junhao Chen , Xiang Li , Tianming Liu

Conversational AI-Enhanced Exploration System to Query Large-Scale Digitised Collections of Natural History Museums

Recent digitisation efforts in natural history museums have produced large volumes of collection data, yet their scale and scientific complexity often hinder public access and understanding. Conventional data management tools, such as…

Human-Computer Interaction · Computer Science 2026-03-12 Yiyuan Wang , Andrew Johnston , Zoë Sadokierski , Rhiannon Stephens , Shane T. Ahyong

UQE: A Query Engine for Unstructured Databases

Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable…

Databases · Computer Science 2024-11-19 Hanjun Dai , Bethany Yixin Wang , Xingchen Wan , Bo Dai , Sherry Yang , Azade Nova , Pengcheng Yin , Phitchaya Mangpo Phothilimthana , Charles Sutton , Dale Schuurmans

Question Answering Survey: Directions, Challenges, Datasets, Evaluation Matrices

The usage and amount of information available on the internet increase over the past decade. This digitization leads to the need for automated answering system to extract fruitful information from redundant and transitional knowledge…

Computation and Language · Computer Science 2022-02-03 Hariom A. Pandya , Brijesh S. Bhatt

Augmented Datasheets for Speech Datasets and Ethical Decision-Making

Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along…

Computers and Society · Computer Science 2023-05-09 Orestis Papakyriakopoulos , Anna Seo Gyeong Choi , Jerone Andrews , Rebecca Bourke , William Thong , Dora Zhao , Alice Xiang , Allison Koenecke

QQ: A Toolkit for Language Identifiers and Metadata

The growing number of languages considered in multilingual NLP, including new datasets and tasks, poses challenges regarding properly and accurately reporting which languages are used and how. For example, datasets often use different…

Computation and Language · Computer Science 2026-03-03 Wessel Poelman , Yiyi Chen , Miryam de Lhoneux

Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages

Recent progress in NLP is driven by pretrained models leveraging massive datasets and has predominantly benefited the world's political and economic superpowers. Technologically underserved languages are left behind because they lack such…

Computation and Language · Computer Science 2022-03-21 Clarissa Forbes , Farhan Samir , Bruce Harold Oliver , Changbing Yang , Edith Coates , Garrett Nicolai , Miikka Silfverberg

Universal Language Modelling agent

Large Language Models are designed to understand complex Human Language. Yet, Understanding of animal language has long intrigued researchers striving to bridge the communication gap between humans and other species. This research paper…

Computation and Language · Computer Science 2023-06-13 Anees Aslam

PRIV-QA: Privacy-Preserving Question Answering for Cloud Large Language Models

The rapid development of large language models (LLMs) is redefining the landscape of human-computer interaction, and their integration into various user-service applications is becoming increasingly prevalent. However, transmitting user…

Computation and Language · Computer Science 2025-02-20 Guangwei Li , Yuansen Zhang , Yinggui Wang , Shoumeng Yan , Lei Wang , Tao Wei

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

Natural Language Processing (NLP) is today a very active field of research and innovation. Many applications need however big sets of data for supervised learning, suitably labelled for the training purpose. This includes applications for…

Computation and Language · Computer Science 2021-02-23 ElMehdi Boujou , Hamza Chataoui , Abdellah El Mekki , Saad Benjelloun , Ikram Chairi , Ismail Berrada

A Formal Framework for Linguistic Annotation (revised version)

`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added…

Computation and Language · Computer Science 2007-05-23 Steven Bird , Mark Liberman

A Formal Framework for Linguistic Annotation

`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added…

Computation and Language · Computer Science 2007-05-23 Steven Bird , Mark Liberman

UQA: Corpus for Urdu Question Answering

This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset…

Computation and Language · Computer Science 2024-07-24 Samee Arif , Sualeha Farid , Awais Athar , Agha Ali Raza