English

Nonlinear multi-study factor analysis

Machine Learning 2026-01-27 v1 Machine Learning

Abstract

High-dimensional data often exhibit variation that can be captured by lower dimensional factors. For high-dimensional data from multiple studies or environments, one goal is to understand which underlying factors are common to all studies, and which factors are study or environment-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are co-expressed; we may expect some clusters (or biological pathways) to be active for all diseases, while some clusters are only active for a specific disease. To learn these factors, we consider a nonlinear multi-study factor model, which allows for both shared and specific factors. To fit this model, we propose a multi-study sparse variational autoencoder. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. In the genomics example, this means each gene is active in only a few biological processes. Further, the model implicitly induces a penalty on the number of latent factors, which helps separate the shared factors from the group-specific factors. We prove that the latent factors are identified, and demonstrate our method recovers meaningful factors in the platelet gene expression data.

Keywords

Cite

@article{arxiv.2601.18128,
  title  = {Nonlinear multi-study factor analysis},
  author = {Gemma E. Moran and Anandi Krishnan},
  journal= {arXiv preprint arXiv:2601.18128},
  year   = {2026}
}
R2 v1 2026-07-01T09:19:39.067Z