Related papers: Spatial Random Sampling: A Structure-Preserving Da…
The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data.…
Sampling from very large spatial populations is challenging. The solutions suggested in recent literature on this subject often require that the randomly selected units are well distributed across the study region by using complex…
The size of large, geo-located datasets has reached scales where visualization of all data points is inefficient. Random sampling is a method to reduce the size of a dataset, yet it can introduce unwanted errors. We describe a method for…
Self-supervised learning has been widely used to obtain transferrable representations from unlabeled images. Especially, recent contrastive learning methods have shown impressive performances on downstream image classification tasks. While…
Computer system simulation studies routinely rely on executing a limited number of short application regions, since full end-to-end simulation is prohibitively time-consuming. To preserve representativeness, existing methods employ either…
Random sampling has become a critical tool in solving massive matrix problems. For linear regression, a small, manageable set of data rows can be randomly selected to approximate a tall, skinny data matrix, improving processing time…
Spectral clustering refers to a family of unsupervised learning algorithms that compute a spectral embedding of the original data based on the eigenvectors of a similarity graph. This non-linear transformation of the data is both the key of…
This article explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is…
Well-spread samples are desirable in many disciplines because they improve estimation when target variables exhibit spatial structure. This paper introduces an integrated methodological framework for spreading samples over the population's…
Traditionally it had been a problem that researchers did not have access to enough spatial data to answer pressing research questions or build compelling visualizations. Today, however, the problem is often that we have too much data.…
Ranked set sampling (RSS) is a stratified sampling method that improves efficiency over simple random sampling (SRS) by utilizing auxiliary information for ranking and stratification. While balanced RSS (BRSS) assumes equal allocation…
The nearest prototype classification is a less computationally intensive replacement for the $k$-NN method, especially when large datasets are considered. In metric spaces, centroids are often used as prototypes to represent whole clusters.…
Accurate land cover segmentation of spectral images is challenging and has drawn widespread attention in remote sensing due to its inherent complexity. Although significant efforts have been made for developing a variety of methods, most of…
Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in…
Feature engineering plays an important role in the success of a machine learning model. Most of the effort in training a model goes into data preparation and choosing the right representation. In this paper, we propose a robust feature…
Spectral clustering is one of the most effective clustering approaches that capture hidden cluster structures in the data. However, it does not scale well to large-scale problems due to its quadratic complexity in constructing similarity…
In several environmental applications data are functions of time, essentially con- tinuous, observed and recorded discretely, and spatially correlated. Most of the methods for analyzing such data are extensions of spatial statistical tools…
When solving real-world problems, practitioners often hesitate to implement solutions obtained from mathematical models, especially for important decisions. This hesitation stems from practitioners' lack of trust in optimization models and…
Bayesian model-based spatial clustering methods are widely used for their flexibility in estimating latent clusters with an unknown number of clusters while accounting for spatial proximity. Many existing methods are designed for clustering…
We present a structural clustering algorithm for large-scale datasets of small labeled graphs, utilizing a frequent subgraph sampling strategy. A set of representatives provides an intuitive description of each cluster, supports the…