Related papers: Data Curation APIs

Augmented Understanding and Automated Adaptation of Curation Rules

Over the past years, there has been many efforts to curate and increase the added value of the raw data. Data curation has been defined as activities and processes an analyst undertakes to transform the raw data into contextualized data and…

Information Retrieval · Computer Science 2020-07-20 Alireza Tabebordbar

Data Curation with Deep Learning [Vision]

Data curation - the process of discovering, integrating, and cleaning data - is one of the oldest, hardest, yet inevitable data management problems. Despite decades of efforts from both researchers and practitioners, it is still one of the…

Databases · Computer Science 2019-03-26 Saravanan Thirumuruganathan , Nan Tang , Mourad Ouzzani , AnHai Doan

Curating Social Media Data

Social media platforms have empowered the democratization of the pulse of people in the modern era. Due to its immense popularity and high usage, data published on social media sites (e.g., Twitter, Facebook and Tumblr) is a rich ocean of…

Social and Information Networks · Computer Science 2020-02-24 Kushal Vaghani

CurateGPT: A flexible language-model assisted biocuration tool

Effective data-driven biomedical discovery requires data curation: a time-consuming process of finding, organizing, distilling, integrating, interpreting, annotating, and validating diverse information into a structured form suitable for…

Computation and Language · Computer Science 2024-11-04 Harry Caufield , Carlo Kroll , Shawn T O'Neil , Justin T Reese , Marcin P Joachimiak , Harshad Hegde , Nomi L Harris , Madan Krishnamurthy , James A McLaughlin , Damian Smedley , Melissa A Haendel , Peter N Robinson , Christopher J Mungall

CaTE Data Curation for Trustworthy AI

This report provides practical guidance to teams designing or developing AI-enabled systems for how to promote trustworthiness during the data curation phase of development. In this report, the authors first define data, the data curation…

Machine Learning · Computer Science 2025-08-21 Mary Versa Clemens-Sewall , Christopher Cervantes , Emma Rafkin , J. Neil Otte , Tom Magelinski , Libby Lewis , Michelle Liu , Dana Udwin , Monique Kirkman-Bey

The craft and coordination of data curation: complicating "workflow" views of data science

Data curation is the process of making a dataset fit-for-use and archiveable. It is critical to data-intensive science because it makes complex data pipelines possible, makes studies reproducible, and makes data (re)usable. Yet the…

Digital Libraries · Computer Science 2022-12-20 Andrea K. Thomer , Dharma Akmon , Jeremy York , Allison R. B. Tyler , Faye Polasek , Sara Lafia , Libby Hemphill , Elizabeth Yakel

Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How…

Human-Computer Interaction · Computer Science 2024-02-23 Tzu-Sheng Kuo , Aaron Halfaker , Zirui Cheng , Jiwoo Kim , Meng-Hsin Wu , Tongshuang Wu , Kenneth Holstein , Haiyi Zhu

APIHarvest: Harvesting API Information from Various Online Sources

Using APIs to develop software applications is the norm. APIs help developers to build applications faster as they do not need to reinvent the wheel. It is therefore important for developers to understand the APIs that they plan to use.…

Software Engineering · Computer Science 2023-04-06 Ferdian Thung , Kisub Kim , Ting Zhang , Ivana Clairine Irsan , Ratnadira Widyasari , Zhou Yang , David Lo

Harnessing Large Language Models for Curated Code Reviews

In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes that ensure an efficient code review process. Well-crafted comments not only streamline the code…

Software Engineering · Computer Science 2025-02-06 Oussama Ben Sghaier , Martin Weyssow , Houari Sahraoui

A Framework for Capturing and Analyzing Unstructured and Semi-structured Data for a Knowledge Management System

Mainstream knowledge management researchers generally agree that knowledge extracted from unstructured data and semi-structured data have become imperative for organizational strategic decision making. In this research, we develop a…

Information Retrieval · Computer Science 2020-07-15 Gerald Onwujekwe , Kweku-Muata Osei-Bryson , Nnatubemugo Ngwum

Automated Extraction and Maturity Analysis of Open Source Clinical Informatics Repositories from Scientific Literature

In the evolving landscape of clinical informatics, the integration and utilization of software tools developed through governmental funding represent a pivotal advancement in research and application. However, the dispersion of these tools…

Digital Libraries · Computer Science 2024-03-28 Jeremy R. Harper

Cross-tier web programming for curated databases: A case study

Curated databases have become important sources of information across scientific disciplines, and due to the manual work of experts, often become important reference works. Features such as provenance tracking, archiving, and data citation…

Programming Languages · Computer Science 2021-07-20 Simon Fowler , Simon D. Harding , Joanna Sharman , James Cheney

Toward a view-based data cleaning architecture

Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult…

Databases · Computer Science 2019-10-25 Toshiyuki Shimizu , Hiroki Omori , Masatoshi Yoshikawa

Making Sense of Data in the Wild: Data Analysis Automation at Scale

As the volume of publicly available data continues to grow, researchers face the challenge of limited diversity in benchmarking machine learning tasks. Although thousands of datasets are available in public repositories, the sheer abundance…

Information Retrieval · Computer Science 2025-02-25 Mara Graziani , Malina Molnar , Irina Espejo Morales , Joris Cadow-Gossweiler , Teodoro Laino

Algorithms for Efficient, Compact Online Data Stream Curation

Data stream algorithms tackle operations on high-volume sequences of read-once data items. Data stream scenarios include inherently real-time systems like sensor networks and financial markets. They also arise in purely-computational…

Data Structures and Algorithms · Computer Science 2024-03-04 Matthew Andres Moreno , Santiago Rodriguez Papa , Emily Dolson

Rethinking Abstractions for Big Data: Why, Where, How, and What

Big data refers to large and complex data sets that, under existing approaches, exceed the capacity and capability of current compute platforms, systems software, analytical tools and human understanding. Numerous lessons on the scalability…

General Literature · Computer Science 2013-06-17 Mary Hall , Robert M. Kirby , Feifei Li , Miriah Meyer , Valerio Pascucci , Jeff M. Phillips , Rob Ricci , Jacobus Van der Merwe , Suresh Venkatasubramanian

Tasks and Roles in Legal AI: Data Curation, Annotation, and Verification

The application of AI tools to the legal field feels natural: large legal document collections could be used with specialized AI to improve workflow efficiency for lawyers and ameliorate the "justice gap" for underserved clients. However,…

Computation and Language · Computer Science 2025-04-03 Allison Koenecke , Jed Stiglitz , David Mimno , Matthew Wilkens

Aggregating Content and Network Information to Curate Twitter User Lists

Twitter introduced user lists in late 2009, allowing users to be grouped according to meaningful topics or themes. Lists have since been adopted by media outlets as a means of organising content around news stories. Thus the curation of…

Social and Information Networks · Computer Science 2012-07-03 Derek Greene , Gavin Sheridan , Barry Smyth , Pádraig Cunningham

DataParasite Enables Scalable and Repurposable Online Data Curation

Many questions in computational social science rely on datasets assembled from heterogeneous online sources, a process that is often labor-intensive, costly, and difficult to reproduce. Recent advances in large language models enable…

Computation and Language · Computer Science 2026-01-07 Mengyi Sun

Improving Unstructured Data Quality via Updatable Extracted Views

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring…

Databases · Computer Science 2025-02-26 Besat Kassaie , Frank Wm. Tompa