Related papers: OpenForge: Probabilistic Metadata Integration
Developing effective multimodal data fusion strategies has become increasingly essential for improving the predictive power of statistical machine learning methods across a wide range of applications, from autonomous driving to medical…
Inference metaprogramming enables effective probabilistic programming by supporting the decomposition of executions of probabilistic programs into subproblems and the deployment of hybrid probabilistic inference algorithms that apply…
The Big Data landscape poses challenges in managing diverse data formats, requiring efficient storage and processing for high-quality analysis. Effective metadata management is crucial for organizing, accessing, and reusing data within…
Integration of multimodal information from various sources has been shown to boost the performance of machine learning models and thus has received increased attention in recent years. Often such models use deep modality-specific networks…
Retrieval-Augmented Generation systems depend on retrieving semantically relevant document chunks to support accurate, grounded outputs from large language models. In structured and repetitive corpora such as regulatory filings, chunk…
Probabilistic Face Embeddings (PFE) can improve face recognition performance in unconstrained scenarios by integrating data uncertainty into the feature representation. However, existing PFE methods tend to be over-confident in estimating…
This paper introduces \textit{Federated Retrieval-Augmented Generation (FRAG)}, a novel database management paradigm tailored for the growing needs of retrieval-augmented generation (RAG) systems, which are increasingly powered by…
Data fusion is the combination of the results of independent searches on a document collection into one single output result set. It has been shown in the past that this can greatly improve retrieval effectiveness over that of the…
Pre-trained multimodal models have achieved significant success in retrieval-based question answering. However, current multimodal retrieval question-answering models face two main challenges. Firstly, utilizing compressed evidence features…
Foundational models (FMs) have made significant strides in the healthcare domain. Yet the data silo challenge and privacy concern remain in healthcare systems, hindering safe medical data sharing and collaborative model development among…
The complexity of financial data, characterized by its variability and low signal-to-noise ratio, necessitates advanced methods in quantitative investment that prioritize both performance and interpretability.Transitioning from early manual…
When machine learning systems meet real world applications, accuracy is only one of several requirements. In this paper, we assay a complementary perspective originating from the increasing availability of pre-trained and regularly…
Metamorphic testing (MT) is widely used for testing programs that face the oracle problem. It uses a set of metamorphic relations (MRs), which are relations among multiple inputs and their corresponding outputs to determine whether the…
Scientific applications produce vast amounts of data, posing grand challenges in the underlying data management and analytic tasks. Progressive compression is a promising way to address this problem, as it allows for on-demand data…
Developing the capacity to effectively search for requisite datasets is an urgent requirement to assist data users in identifying relevant datasets considering the very limited available metadata. For this challenge, the utilization of…
Metamorphic Testing (MT) addresses the test oracle problem by examining the relations between inputs and outputs of test executions. Such relations are known as Metamorphic Relations (MRs). In current practice, identifying and selecting…
This paper presents an approach for metadata reconciliation, curation and linking for Open Governamental Data Portals (ODPs). ODPs have been lately the standard solution for governments willing to put their public data available for the…
In massively collaborative projects such as scientific or community databases, users often need to agree or disagree on the content of individual data items. On the other hand, trust relationships often exist between users, allowing them to…
Data preprocessing is an important component of machine learning pipelines, which requires ample time and resources. An integral part of preprocessing is data transformation into the format required by a given learning algorithm. This paper…
Most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a 'flat'…