English

$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets

Quantitative Methods 2026-01-08 v2

Abstract

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed S-entropy (similarity-sensitive entropy), that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed sentropy\textit{sentropy}, a Python package that calculates S-entropy and is tailored to large datasets. sentropy\textit{sentropy} can calculate any of the frequency-sensitive measures of Hill's D-number framework and their similarity-sensitive counterparts. sentropy\textit{sentropy} also outputs measures that compare datasets. We first briefly review S-entropy, illustrating how it incorporates elements' frequencies and elements' pairwise similarities. We then describe sentropy\textit{sentropy}'s key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating sentropy\textit{sentropy}'s applicability across a range of dataset types and fields.

Keywords

Cite

@article{arxiv.2401.00102,
  title  = {$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets},
  author = {Phuc Nguyen and Rohit Arora and Elliot D. Hill and Jasper Braun and Alexandra Morgan and Liza M. Quintana and Gabrielle Mazzoni and Ghee Rye Lee and Rima Arnaout and Ramy Arnaout},
  journal= {arXiv preprint arXiv:2401.00102},
  year   = {2026}
}

Comments

43 pages, 8 figures