$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets

Phuc Nguyen; Rohit Arora; Elliot D. Hill; Jasper Braun; Alexandra Morgan; Liza M. Quintana; Gabrielle Mazzoni; Ghee Rye Lee; Rima Arnaout; Ramy Arnaout

$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets

Quantitative Methods 2026-01-08 v2

Authors: Phuc Nguyen , Rohit Arora , Elliot D. Hill , Jasper Braun , Alexandra Morgan , Liza M. Quintana , Gabrielle Mazzoni , Ghee Rye Lee , Rima Arnaout , Ramy Arnaout

View on arXiv ↗ PDF ↗

Abstract

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed S-entropy (similarity-sensitive entropy), that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed $\textit{sentropy}$ , a Python package that calculates S-entropy and is tailored to large datasets. $\textit{sentropy}$ can calculate any of the frequency-sensitive measures of Hill's D-number framework and their similarity-sensitive counterparts. $\textit{sentropy}$ also outputs measures that compare datasets. We first briefly review S-entropy, illustrating how it incorporates elements' frequencies and elements' pairwise similarities. We then describe $\textit{sentropy}$ 's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating $\textit{sentropy}$ 's applicability across a range of dataset types and fields.

Keywords

gene expression analysis spatial ecology and biodiversity genome sequencing analysis

Cite

@article{arxiv.2401.00102,
  title  = {$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets},
  author = {Phuc Nguyen and Rohit Arora and Elliot D. Hill and Jasper Braun and Alexandra Morgan and Liza M. Quintana and Gabrielle Mazzoni and Ghee Rye Lee and Rima Arnaout and Ramy Arnaout},
  journal= {arXiv preprint arXiv:2401.00102},
  year   = {2026}
}

Comments

43 pages, 8 figures

$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets

Abstract

Keywords

Cite

Comments

Related papers