Related papers: Data Integration with High Dimensionality
This article deals with the analysis of high dimensional data that come from multiple sources (experiments) and thus have different possibly correlated responses, but share the same set of predictors. The measurements of the predictors may…
Existing work on understanding deep learning often employs measures that compress all data-dependent information into a few numbers. In this work, we adopt a perspective based on the role of individual examples. We introduce a measure of…
The integrative analysis of multiple datasets is an important strategy in data analysis. It is increasingly popular in genomics, which enjoys a wealth of publicly available datasets that can be compared, contrasted, and combined in order to…
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome,…
Motivation: Gene selection has become a common task in most gene expression studies. The objective of such research is often to identify the smallest possible set of genes that can still achieve good predictive performance. The problem of…
In clinical settings, we often face the challenge of building prediction models based on small observational data sets. For example, such a data set might be from a medical center in a multi-center study. Differences between centers might…
Identifying disease genes from human genome is an important and fundamental problem in biomedical research. Despite many publications of machine learning methods applied to discover new disease genes, it still remains a challenge because of…
We consider the problem of inferring the values of an arbitrary set of variables (e.g., risk of diseases) given other observed variables (e.g., symptoms and diagnosed diseases) and high-dimensional signals (e.g., MRI images or EEG). This is…
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from…
We consider statistical inference in high-dimensional regression problems under affine constraints on the parameter space. The theoretical study of this is motivated by the study of genetic determinants of diseases, such as diabetes, using…
Over the years, there has been growing interest in using Machine Learning techniques for biomedical data processing. When tackling these tasks, one needs to bear in mind that biomedical data depends on a variety of characteristics, such as…
We consider high-dimensional regression over subgroups of observations. Our work is motivated by biomedical problems, where disease subtypes, for example, may differ with respect to underlying regression models, but sample sizes at the…
Many Artificial Intelligence systems depend on the agent's updating its beliefs about the world on the basis of experience. Experiments constitute one type of experience, so scientific methodology offers a natural environment for examining…
In scientific inference problems, the underlying statistical modeling assumptions have a crucial impact on the end results. There exist, however, only a few automatic means for validating these fundamental modelling assumptions. The…
Machine learning models are often trained to predict the outcome resulting from a human decision. For example, if a doctor decides to test a patient for disease, will the patient test positive? A challenge is that historical decision-making…
Probabilistic modeling is cyclical: we specify a model, infer its posterior, and evaluate its performance. Evaluation drives the cycle, as we revise our model based on how it performs. This requires a metric. Traditionally, predictive…
Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to many applications, such as density estimation, outlier detection,…
A data science task can be deemed as making sense of the data or testing a hypothesis about it. The conclusions inferred from data can greatly guide us to make informative decisions. Big data has enabled us to carry out countless prediction…
Instrumental variable methods provide useful tools for inferring causal effects in the presence of unmeasured confounding. To apply these methods with large-scale data sets, a major challenge is to find valid instruments from a possibly…
[PhD thesis of FCP.] Nowadays, genetics studies large amounts of very diverse variables. Mathematical statistics has evolved in parallel to its applications, with much recent interest high-dimensional settings. In the genetics of human…