数据库
We present an index structure to boost the evaluation of free-connex acyclic conjunctive queries (fc-ACQs) over relational databases. The main ingredient of the index associated with a given database $D$ is an auxiliary database $D_{col}$.…
Many operational systems collect high-dimensional timeseries data about users/systems on key performance metrics. For instance, ISPs, content distribution networks, and video delivery services collect quality of experience metrics for user…
Most research on query optimization has centered on binary join algorithms like hash join and sort-merge join. However, recent years have seen growing interest in theoretically optimal algorithms, notably Yannakakis' algorithm. These…
Natural Language Interfaces for Databases empower non-technical users to interact with data using natural language (NL). Advanced approaches, utilizing either neural sequence-to-sequence or more recent sophisticated large-scale language…
The Natural Language Interface to Databases (NLIDB) empowers non-technical users with database access through intuitive natural language (NL) interactions. Advanced approaches, utilizing neural sequence-to-sequence models or large-scale…
Data discovery and preparation remain persistent bottlenecks in the data management lifecycle, especially when user intent is vague, evolving, or difficult to operationalize. The Pneuma Project introduces Pneuma-Seeker, a system that helps…
Enterprise deployments of vector databases require access control policies to protect sensitive data. These systems often implement access control through hybrid vector queries that combine nearest-neighbor search with relational predicates…
In recent years, large language models (LLMs) have demonstrated remarkable capabilities in comprehending and generating natural language content, attracting widespread attention in both industry and academia. An increasing number of…
Recent advances in quantum computing have led to progress in exploring quantum applications across diverse fields, including databases and data management. This work presents a quantum machine learning model that tackles the challenge of…
Approximate Nearest Neighbor Search (ANNS) is a fundamental operation in vector databases, enabling efficient similarity search in high-dimensional spaces. While dense ANNS has been optimized using specialized hardware accelerators, sparse…
Given a table T in a database and a question Q in natural language, the table question answering (TQA) task aims to return an accurate answer to Q based on the content of T. Recent state-of-the-art solutions leverage large language models…
This paper describes a new process and software system, the Case Count Metric System (CCMS), for systematically comparing and analyzing the outcomes of two different ER clustering processes acting on the same dataset when the true linking…
In today's age, it is becoming increasingly difficult to decipher truth from lies. Every day, politicians, media outlets, and public figures make conflicting claims -- often about topics that can, in principle, be verified against…
Tabular data constitute a dominant form of information in modern data lakes and repositories, yet discovering the relevant tables to answer user questions remains challenging. Existing data discovery systems assume that each question can be…
Vector search (VS) has become a fundamental component in multimodal data management, enabling core functionalities such as image, video, and code retrieval. As vector data scales rapidly, VS faces growing challenges in balancing search,…
Memory overload is a common form of resource exhaustion in cloud data warehouses. When database queries fail due to memory overload, it not only wastes critical resources such as CPU time but also disrupts the execution of core business…
Cloud storage has become the backbone of modern data infrastructure, yet privacy and efficient data retrieval remain significant challenges. Traditional privacy-preserving approaches primarily focus on enhancing database security but fail…
Complex Event Recognition (CER) systems are used to identify complex patterns in event streams, such as those found in stock markets, sensor networks, and other similar applications. An important task in such patterns is aggregation, which…
Hybrid search has emerged as a promising paradigm that combines lexical and semantic retrieval, enhancing accuracy for applications such as recommendations, information retrieval, and Retrieval-Augmented Generation. However, existing…
The rise of graph analytics platforms has led to the development of various benchmarks for evaluating and comparing platform performance. However, existing benchmarks often fall short of fully assessing performance due to limitations in…