Related papers: Dataset Generation Patterns for Evaluating Knowled…

Interactively Constructing Knowledge Graphs from Messy User-Generated Spreadsheets

When spreadsheets are filled freely by knowledge workers, they can contain rather unstructured content. For humans and especially machines it becomes difficult to interpret such data properly. Therefore, spreadsheets are often converted to…

Databases · Computer Science 2021-03-08 Markus Schröder , Christian Jilek , Michael Schulze , Andreas Dengel

Using Large Language Models to Generate Authentic Multi-agent Knowledge Work Datasets

Current publicly available knowledge work data collections lack diversity, extensive annotations, and contextual information about the users and their documents. These issues hinder objective and comparable data-driven evaluations and…

Artificial Intelligence · Computer Science 2024-10-25 Desiree Heim , Christian Jilek , Adrian Ulges , Andreas Dengel

Synthetic Dataset Generation with Itemset-Based Generative Models

This paper proposes three different data generators, tailored to transactional datasets, based on existing itemset-based generative models. All these generators are intuitive and easy to implement and show satisfactory performance. The…

Databases · Computer Science 2020-07-15 Christian Lezcano , Marta Arias

A Framework for Large Scale Synthetic Graph Dataset Generation

Recently there has been increasing interest in developing and deploying deep graph learning algorithms for many tasks, such as fraud detection and recommender systems. Albeit, there is a limited number of publicly available graph-structured…

Machine Learning · Computer Science 2023-10-06 Sajad Darabi , Piotr Bigaj , Dawid Majchrowski , Artur Kasymov , Pawel Morkisz , Alex Fit-Florea

Towards More Usable Dataset Search: From Query Characterization to Snippet Generation

Reusing published datasets on the Web is of great interest to researchers and developers. Their data needs may be met by submitting queries to a dataset search engine to retrieve relevant datasets. In this ongoing work towards developing a…

Information Retrieval · Computer Science 2019-08-30 Jinchi Chen , Xiaxia Wang , Gong Cheng , Evgeny Kharlamov , Yuzhong Qu

A Framework for Evaluating Snippet Generation for Dataset Search

Reusing existing datasets is of considerable significance to researchers and developers. Dataset search engines help a user find relevant datasets for reuse. They can present a snippet for each retrieved dataset to explain its relevance to…

Information Retrieval · Computer Science 2019-07-03 Xiaxia Wang , Jinchi Chen , Shuxin Li , Gong Cheng , Jeff Z. Pan , Evgeny Kharlamov , Yuzhong Qu

Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities

Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 Hatef Otroshi Shahreza , Sébastien Marcel

Generating Privacy-Preserving Process Data with Deep Generative Models

Process data with confidential information cannot be shared directly in public, which hinders the research in process data mining and analytics. Data encryption methods have been studied to protect the data, but they still may be decrypted,…

Machine Learning · Computer Science 2022-03-16 Keyi Li , Sen Yang , Travis M. Sullivan , Randall S. Burd , Ivan Marsic

ENT-DESC: Entity Description Generation by Exploring Knowledge Graph

Previous works on knowledge-to-text generation take as input a few RDF triples or key-value pairs conveying the knowledge of some entities to generate a natural language description. Existing datasets, such as WIKIBIO, WebNLG, and E2E,…

Computation and Language · Computer Science 2020-10-27 Liying Cheng , Dekun Wu , Lidong Bing , Yan Zhang , Zhanming Jie , Wei Lu , Luo Si

Synthetic Data Generation for Economists

As more tech companies engage in rigorous economic analyses, we are confronted with a data problem: in-house papers cannot be replicated due to use of sensitive, proprietary, or private data. Readers are left to assume that the obscured…

General Economics · Economics 2020-11-10 Allison Koenecke , Hal Varian

Bias Reduction via Cooperative Bargaining in Synthetic Graph Dataset Generation

In general, to draw robust conclusions from a dataset, all the analyzed population must be represented on said dataset. Having a dataset that does not fulfill this condition normally leads to selection bias. Additionally, graphs have been…

Machine Learning · Computer Science 2022-05-30 Axel Wassington , Sergi Abadal

Towards a property graph generator for benchmarking

The use of synthetic graph generators is a common practice among graph-oriented benchmark designers, as it allows obtaining graphs with the required scale and characteristics. However, finding a graph generator that accurately fits the…

Databases · Computer Science 2017-04-04 Arnau Prat-Pérez , Joan Guisado-Gámez , Xavier Fernández Salas , Petr Koupy , Siegfried Depner , Davide Basilio Bartolini

Towards Exploiting Background Knowledge for Building Conversation Systems

Existing dialog datasets contain a sequence of utterances and responses without any explicit background knowledge associated with them. This has resulted in the development of models which treat conversation as a sequence-to-sequence…

Computation and Language · Computer Science 2018-09-24 Nikita Moghe , Siddhartha Arora , Suman Banerjee , Mitesh M. Khapra

Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems

Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many…

Information Retrieval · Computer Science 2024-12-11 Miha Malenšek , Blaž Škrlj , Blaž Mramor , Jure Demšar

Control+Shift: Generating Controllable Distribution Shifts

We propose a new method for generating realistic datasets with distribution shifts using any decoder-based generative model. Our approach systematically creates datasets with varying intensities of distribution shifts, facilitating a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-13 Roy Friedman , Rhea Chowers

Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation

Obtaining real-world network datasets is often challenging because of privacy, security, and computational constraints. In the absence of such datasets, graph generative models become essential tools for creating synthetic datasets. In this…

Machine Learning · Computer Science 2025-05-13 Arya Grayeli , Vipin Swarup , Steven E. Noel

Constrained Diffusion Models for Synthesizing Representative Power Flow Datasets

High-quality power flow datasets are essential for training machine learning models in power systems. However, security and privacy concerns restrict access to real-world data, making statistically accurate and physically consistent…

Machine Learning · Computer Science 2025-08-26 Milad Hoseinpour , Vladimir Dvorkin

Guided Graph Generation: Evaluation of Graph Generators in Terms of Network Statistics, and a New Algorithm

We consider the problem of graph generation guided by network statistics, i.e., the generation of graphs which have given values of various numerical measures that characterize networks, such as the clustering coefficient and the number of…

Social and Information Networks · Computer Science 2023-03-02 Jérôme Kunegis , Jun Sun , Eiko Yoneki

Generated Graph Detection

Graph generative models become increasingly effective for data distribution approximation and data augmentation. While they have aroused public concerns about their malicious misuses or misinformation broadcasts, just as what Deepfake…

Cryptography and Security · Computer Science 2023-06-14 Yihan Ma , Zhikun Zhang , Ning Yu , Xinlei He , Michael Backes , Yun Shen , Yang Zhang

Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators

The increasing demand for high-quality datasets in machine learning has raised concerns about the ethical and responsible creation of these datasets. Dataset creators play a crucial role in developing responsible practices, yet their…

Machine Learning · Computer Science 2024-09-04 Will Orr , Kate Crawford