Related papers: Data Smashing

Deep Learning to Jointly Schema Match, Impute, and Transform Databases

An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable…

Databases · Computer Science 2022-07-11 Sandhya Tripathi , Bradley A. Fritz , Mohamed Abdelhack , Michael S. Avidan , Yixin Chen , Christopher R. King

Data Smashing 2.0: Sequence Likelihood (SL) Divergence For Fast Time Series Comparison

Recognizing subtle historical patterns is central to modeling and forecasting problems in time series analysis. Here we introduce and develop a new approach to quantify deviations in the underlying hidden generators of observed data…

Machine Learning · Statistics 2019-10-09 Yi Huang , Ishanu Chattopadhyay

Benchmark and application of unsupervised classification approaches for univariate data

Unsupervised machine learning, and in particular data clustering, is a powerful approach for the analysis of datasets and identification of characteristic features occurring throughout a dataset. It is gaining popularity across scientific…

Mesoscale and Nanoscale Physics · Physics 2021-03-23 Maria El Abbassi , Jan Overbeck , Oliver Braun , Michel Calame , Herre S. J. van der Zant , Mickael L. Perrin

Natural data structure extracted from neighborhood-similarity graphs

'Big' high-dimensional data are commonly analyzed in low-dimensions, after performing a dimensionality-reduction step that inherently distorts the data structure. For the same purpose, clustering methods are also often used. These methods…

Machine Learning · Statistics 2019-02-20 Tom Lorimer , Karlis Kanders , Ruedi Stoop

Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach

Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The…

Machine Learning · Statistics 2023-10-20 Dimitrios Saligkaras , Vasileios E. Papageorgiou

SLiMFast: Guaranteed Results for Data Fusion and Source Reliability

We focus on data fusion, i.e., the problem of unifying conflicting data from data sources into a single representation by estimating the source accuracies. We propose SLiMFast, a framework that expresses data fusion as a statistical…

Databases · Computer Science 2016-11-15 Manas Joglekar , Theodoros Rekatsinas , Hector Garcia-Molina , Aditya Parameswaran , Christopher Ré

CLARITY -- Comparing heterogeneous data using dissimiLARITY

Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale, and reliability. When two datasets describe the same entities, many scientific questions can be phrased around…

Methodology · Statistics 2021-12-03 Daniel J. Lawson , Vinesh Solanki , Igor Yanovich , Johannes Dellert , Damian Ruck , Phillip Endicott

Statistical Inference for Manifold Similarity and Alignability across Noisy High-Dimensional Datasets

The rapid growth of high-dimensional datasets across various scientific domains has created a pressing need for new statistical methods to compare distributions supported on their underlying structures. Assessing similarity between datasets…

Statistics Theory · Mathematics 2025-11-27 Hongrui Chen , Rong Ma

Sifting data in the real world

In the real world, experimental data are rarely, if ever, distributed as a normal (Gaussian) distribution. As an example, a large set of data--such as the cross sections for particle scattering as a function of energy contained in the…

Data Analysis, Statistics and Probability · Physics 2009-11-11 Martin M. Block

Learning New Physics from Data -- a Symmetrized Approach

Thousands of person-years have been invested in searches for New Physics (NP), the majority of them motivated by theoretical considerations. Yet, no evidence of beyond the Standard Model (BSM) physics has been found. This suggests that…

High Energy Physics - Experiment · Physics 2024-10-22 Shikma Bressler , Inbar Savoray , Yuval Zurgil

A method to challenge symmetries in data with self-supervised learning

Symmetries are key properties of physical models and of experimental designs, but any proposed symmetry may or may not be realized in nature. In this paper, we introduce a practical and general method to test such suspected symmetries in…

High Energy Physics - Phenomenology · Physics 2022-08-25 Rupert Tombs , Christopher G. Lester

Demystifying Statistical Matching Algorithms for Big Data

Statistical matching is an effective method for estimating causal effects in which treated units are paired with control units with ``similar'' values of confounding covariates prior to performing estimation. In this way, matching helps…

Methodology · Statistics 2023-09-13 Sanjeewani Weerasingha , Michael J. Higgins

Adversarial Learning for Feature Shift Detection and Correction

Data shift is a phenomenon present in many real-world applications, and while there are multiple methods attempting to detect shifts, the task of localizing and correcting the features originating such shifts has not been studied in depth.…

Machine Learning · Computer Science 2023-12-08 Miriam Barrabes , Daniel Mas Montserrat , Margarita Geleta , Xavier Giro-i-Nieto , Alexander G. Ioannidis

Data Stream Clustering: A Review

Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for…

Machine Learning · Computer Science 2020-07-22 Alaettin Zubaroğlu , Volkan Atalay

Unsupervised clustering analysis: a multiscale complex networks approach

Unsupervised clustering, also known as natural clustering, stands for the classification of data according to their similarities. Here we study this problem from the perspective of complex networks. Mapping the description of data…

Data Analysis, Statistics and Probability · Physics 2012-08-22 Clara Granell , Sergio Gomez , Alex Arenas

Supervised Quantization for Similarity Search

In this paper, we address the problem of searching for semantically similar images from a large database. We present a compact coding approach, supervised quantization. Our approach simultaneously learns feature selection that linearly…

Computer Vision and Pattern Recognition · Computer Science 2019-02-05 Xiaojuan Wang , Ting Zhang , Guo-Jun Q , Jinhui Tang , Jingdong Wang

Spectral Clustering of Categorical and Mixed-type Data via Extra Graph Nodes

Clustering data objects into homogeneous groups is one of the most important tasks in data mining. Spectral clustering is arguably one of the most important algorithms for clustering, as it is appealing for its theoretical soundness and is…

Machine Learning · Statistics 2024-03-12 Dylan Soemitro , Jeova Farias Sales Rocha Neto

Distance Functions and Normalization Under Stream Scenarios

Data normalization is an essential task when modeling a classification system. When dealing with data streams, data normalization becomes especially challenging since we may not know in advance the properties of the features, such as their…

Machine Learning · Computer Science 2026-03-30 Eduardo V. L. Barboza , Paulo R. Lisboa de Almeida , Alceu de Souza Britto , Rafael M. O. Cruz

A method for classification of data with uncertainty using hypothesis testing

Binary classification is a task that involves the classification of data into one of two distinct classes. It is widely utilized in various fields. However, conventional classifiers tend to make overconfident predictions for data that…

Machine Learning · Computer Science 2025-03-13 Shoma Yokura , Akihisa Ichiki

Overview of streaming-data algorithms

Due to recent advances in data collection techniques, massive amounts of data are being collected at an extremely fast pace. Also, these data are potentially unbounded. Boundless streams of data collected from sensors, equipments, and other…

Databases · Computer Science 2012-03-12 T Soni Madhulatha