English
Related papers

Related papers: Generating Multidimensional Clusters With Support …

200 papers

Cluster analysis relies on effective benchmarks for evaluating and comparing different algorithms. Simulation studies on synthetic data are popular because important features of the data sets, such as the overlap between clusters, or the…

Machine Learning · Computer Science 2025-02-19 Michael J. Zellinger , Peter Bühlmann

Network representations can help reveal the behavior of complex systems. Useful information can be derived from the network properties and invariants, such as components, clusters or cliques, as well as from their changes over time. The…

Social and Information Networks · Computer Science 2019-03-18 Luis Ramada Pereira , Rui J. Lopes , Jorge Louçã

Synthetic data generation has emerged as an invaluable solution in scenarios where real-world data collection and usage are limited by cost and scarcity. Large language models (LLMs) have demonstrated remarkable capabilities in producing…

Machine Learning · Computer Science 2025-07-22 Anh Nguyen , Sam Schafft , Nicholas Hale , John Alfaro

We provide new algorithms for two tasks relating to heterogeneous tabular datasets: clustering, and synthetic data generation. Tabular datasets typically consist of heterogeneous data types (numerical, ordinal, categorical) in columns, but…

Machine Learning · Computer Science 2024-04-22 Chandrani Kumari , Rahul Siddharthan

Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Leonardo Brusini , Cristian Sbrolli , Eugenio Lomurno , Toshihiko Yamasaki , Matteo Matteucci

Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data. Unfortunately, many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available…

Machine Learning · Computer Science 2020-12-09 James Jordon , Alan Wilson , Mihaela van der Schaar

Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many…

Information Retrieval · Computer Science 2024-12-11 Miha Malenšek , Blaž Škrlj , Blaž Mramor , Jure Demšar

When faced with new data, we often conduct a cluster analysis to obtain a better understanding of the data's structure and the archetypical samples present in the data. This process often includes visualization of the data, either as a way…

Applications · Statistics 2026-04-06 Justin Lin , Julia Fukuyama

Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of…

Computation and Language · Computer Science 2024-06-24 Lin Long , Rui Wang , Ruixuan Xiao , Junbo Zhao , Xiao Ding , Gang Chen , Haobo Wang

Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what…

Cryptography and Security · Computer Science 2022-11-22 Florimond Houssiau , Samuel N. Cohen , Lukasz Szpruch , Owen Daniel , Michaela G. Lawrence , Robin Mitra , Henry Wilde , Callum Mole

Recent breakthroughs in synthetic data generation approaches made it possible to produce highly photorealistic images which are hardly distinguishable from real ones. Furthermore, synthetic generation pipelines have the potential to…

Computer Vision and Pattern Recognition · Computer Science 2023-07-06 Alon Shoshan , Nadav Bhonker , Igor Kviatkovsky , Matan Fintz , Gerard Medioni

Individual-level data (microdata) that characterizes a population, is essential for studying many real-world problems. However, acquiring such data is not straightforward due to cost and privacy constraints, and access is often limited to…

Machine Learning · Computer Science 2022-12-13 Angeela Acharya , Siddhartha Sikdar , Sanmay Das , Huzefa Rangwala

The generation of synthetic data is an essential tool to study complex systems, allowing for example to test models of these in precisely controlled settings, or to parametrize simulation models when data is missing. This paper focuses on…

Applications · Statistics 2019-11-25 Juste Raimbault

With the development of machine learning and data science, data sharing is very common between companies and research institutes to avoid data scarcity. However, sharing original datasets that contain private information can cause privacy…

Machine Learning · Computer Science 2022-11-30 Mingchen Li , Di Zhuang , J. Morris Chang

Synthetic data generation has been widely adopted in software testing, data privacy, imbalanced learning, and artificial intelligence explanation. In all such contexts, it is crucial to generate plausible data samples. A common assumption…

Artificial Intelligence · Computer Science 2024-10-16 Martina Cinquini , Fosca Giannotti , Riccardo Guidotti

The availability of large datasets is crucial for the development of new power system applications and tools; unfortunately, very few are publicly and freely available. We designed an end-to-end generative framework for the creation of…

Systems and Control · Electrical Eng. & Systems 2022-07-26 Andrea Pinceti , Lalitha Sankar , Oliver Kosut

This paper proposes a new method to generate synthetic data sets based on copula models. Our goal is to produce surrogate data resembling real data in terms of marginal and joint distributions. We present a complete and reliable algorithm…

Machine Learning · Computer Science 2022-04-01 Regis Houssou , Mihai-Cezar Augustin , Efstratios Rappos , Vivien Bonvin , Stephan Robert-Nicoud

The recent emerging fields in data processing and manipulation has facilitated the need for synthetic data generation. This is also valid for mobility encounter dataset generation. Synthetic data generation might be useful to run…

Networking and Internet Architecture · Computer Science 2020-02-21 Rajarshi Haldar , Salih Safa Bacanli , Moayad Aloqaily , Adel Ben Mnaouer , Damla Turgut

Deep learning has shown excellent performance in analysing medical images. However, datasets are difficult to obtain due privacy issues, standardization problems, and lack of annotations. We address these problems by producing realistic…

Image and Video Processing · Electrical Eng. & Systems 2022-02-18 Enric Moreu , Kevin McGuinness , Noel E. O'Connor

Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details. But collecting, processing and annotating real…

‹ Prev 1 2 3 10 Next ›