数据库
Approximate Nearest Neighbor Search (ANNS) is essential for various data-intensive applications, including recommendation systems, image retrieval, and machine learning. Scaling ANNS to handle billions of high-dimensional vectors on a…
Sketches have shown high accuracy in multi-way join cardinality estimation, a critical problem in cost-based query optimization. Accurately estimating the cardinality of a join operation -- analogous to its computational cost -- allows the…
The application of Large Language Models (LLMs) to text-to-SQL tasks promises to democratize data access, particularly in critical industries like aviation Maintenance, Repair, and Operation (MRO). However, progress is hindered by two key…
Recent advances in language models opened new opportunities to address complex schema matching tasks. Schema matching approaches have been proposed that demonstrate the usefulness of language models, but they have also uncovered important…
We study the problem of answering conjunctive queries with free access patterns (CQAPs) under updates. A free access pattern is a partition of the free variables of the query into input and output. The query returns tuples over the output…
Sideways information passing is a well-known technique for mitigating the impact of large build sides in a database query plan. As currently implemented in production systems, sideways information passing enables only a uni-directional…
Data discovery and table unionability in particular became key tasks in modern Data Science. However, the human perspective for these tasks is still under-explored. Thus, this research investigates the human behavior in determining table…
Electronic medical records (EMRs) contain essential data for patient care and clinical research. With the diversity of structured and unstructured data in EHR, data visualization is an invaluable tool for managing and explaining these…
Instance-optimized components have made their way into production systems. To some extent, this adoption is due to the characteristics of customer workloads, which can be individually leveraged during the model training phase. However,…
Colored Petri Nets (CPNs) are an established formalism for modeling processes where tokens carry data. Although tools like CPN Tools and CPN IDE excel at CPN-based simulation, they are often separate from modern data science ecosystems.…
Text-to-SQL systems enable users to query databases using natural language, democratizing access to data analytics. However, they face challenges in understanding ambiguous phrasing, domain-specific vocabulary, and complex schema…
Scientific experiments and modern applications are generating large amounts of data every day. Most organizations utilize In-house servers or Cloud resources to manage application data and workload. The traditional database management…
Text-to-SQL, which enables natural language interaction with databases, serves as a pivotal method across diverse industries. With new, more powerful large language models (LLMs) emerging every few months, fine-tuning has become incredibly…
Deletion Propagation (DP) refers to a family of database problems rooted in the classical view-update problem: how to propagate intended deletions in a view (query output) back to the source database while satisfying constraints and…
Database connectors are critical components enabling applications to interact with underlying database management systems (DBMS), yet their security vulnerabilities often remain overlooked. Unlike traditional software defects, connector…
Process querying is used to extract information and insights from process execution data. Similarly, process constraints can be checked against input data, yielding information on which process instances violate them. Traditionally, such…
To meet the needs of a large pharmaceutical organization, we set out to create S3Mirror - an application for transferring large genomic sequencing datasets between S3 buckets quickly, reliably, and observably. We used the DBOS Transact…
Query optimization has become a research area where classical algorithms are being challenged by machine learning algorithms. At the same time, recent trends in learned query optimizers have shown that it is prudent to take advantage of…
The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of…
ArcNeural introduces a novel multimodal database tailored for the demands of Generative AI and Large Language Models, enabling efficient management of diverse data types such as graphs, vectors, and documents. Its storage-compute separated…