Related papers: Improving Schema Matching with Linked Data
Scholars studying organizations often work with multiple datasets lacking shared identifiers or covariates. In such situations, researchers usually use approximate string ("fuzzy") matching methods to combine datasets. String matching,…
Current network training paradigms primarily focus on either centralized or decentralized data regimes. However, in practice, data availability often exhibits a hybrid nature, where both regimes coexist. This hybrid setting presents new…
A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior…
Linked Data have emerged as a successful publication format and one of its main strengths is its fitness for integration of data from multiple sources. This gives them a great potential both for semantic applications and the enterprise…
Enterprises often own large collections of structured data in the form of large databases or an enterprise data lake. Such data collections come with limited metadata and strict access policies that could limit access to the data contents…
Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference…
As organizations continue to access diverse datasets, the demand for effective data integration has increased. Key tasks in this process, such as schema matching and entity resolution, are essential but often require significant effort.…
Many data we collect today are in tabular form, with rows as records and columns as attributes associated with each record. Understanding the structural relationship in tabular data can greatly facilitate the data science process.…
In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same…
Enterprises often maintain multiple databases for storing critical business data in siloed systems, resulting in inefficiencies and challenges with data interoperability. A key to overcoming these challenges lies in integrating disparate…
Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the comparison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer…
The abundant semi-structured data on the Web, such as HTML-based tables and lists, provide commercial search engines a rich information source for question answering (QA). Different from plain text passages in Web documents, Web tables and…
In this paper, we present an algorithm for automatically building expertise evidence for finding experts within an organization by combining structured corporate information with different content. We also describe our test data collection…
Large language models (LLMs) have shown promise in table Question Answering (Table QA). However, extending these capabilities to multi-table QA remains challenging due to unreliable schema linking across complex tables. Existing methods…
Creating challenging tabular inference data is essential for learning complex reasoning. Prior work has mostly relied on two data generation strategies. The first is human annotation, which yields linguistically diverse data but is…
In the field of business data analysis, the ability to extract actionable insights from vast and varied datasets is essential for informed decision-making and maintaining a competitive edge. Traditional rule-based systems, while reliable,…
Retrieval-Augmented Generation (RAG) systems typically treat documents as flat text, ignoring the structured metadata and linked relationships that knowledge graphs provide. In this paper, we investigate whether structured linked data,…
Data-driven analysis is important in virtually every modern organization. Yet, most data is underutilized because it remains locked in silos inside of organizations; large organizations have thousands of databases, and billions of files…
Large knowledge bases typically contain data adhering to various schemas with incomplete and/or noisy type information. This seriously complicates further integration and post-processing efforts, as type information is crucial in correctly…
The need for Knowledge and Data Discovery Management Systems (KDDMS) that support ad hoc data mining queries has been long recognized. A significant amount of research has gone into building tightly coupled systems that integrate…