数据库
Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However,…
Relational databases are foundational to numerous domains, including business intelligence, scientific research, and enterprise systems. However, accessing and analyzing structured data often requires proficiency in SQL, which is a skill…
Generative AI is transforming business applications by enabling natural language interfaces and intelligent automation. However, the underlying large language models (LLMs) are evolving rapidly and so prompting them consistently is a…
Trajectory mining has attracted significant attention. This paper addresses the Top-k Representative Similar Subtrajectory Query (TRSSQ) problem, which aims to find the k most representative subtrajectories similar to a query. Existing…
Large language models (LLMs) can generate code from natural language descriptions. Their performance is typically evaluated using programming benchmarks that simulate real-world tasks. These benchmarks provide specifications in the form of…
How to generate a large, realistic set of tables along with joinability relationships, to stress-test dataset discovery methods? Dataset discovery methods aim to automatically identify related data assets in a data lake. The development and…
Over the recent years, Shapley value (SV), a solution concept from cooperative game theory, has found numerous applications in data analytics (DA). This paper presents the first comprehensive study of SV used throughout the DA workflow,…
Large Language Models (LLMs) can enhance analytics systems with powerful data summarization, cleaning, and semantic transformation capabilities. However, deploying LLMs at scale -- processing millions to billions of rows -- remains…
Increasingly massive volumes of multi-modal data are being accumulated in many {real world} settings, including in health care and e-commerce. This development calls for effective general-purpose data management solutions for multi-modal…
Cache systems fundamentally limit modern computing performance due to their inability to precisely capture data relationships. While achieving 85-92% hit rates, traditional systems rely on statistical heuristics that cannot guarantee…
Query optimization is essential for efficient SQL query execution in DBMS, and remains attractive over time due to the growth of data volumes and advances in hardware. Existing traditional optimizers struggle with the cumbersome hand-tuning…
In Complex Event Processing, handling out-of-order, late, and duplicate events is critical for real-time analytics, especially on resource-constrained devices that process heterogeneous data from multiple sources. We present LimeCEP, a…
Large language model (LLM) embeddings offer a promising new avenue for database query optimization. In this paper, we explore how pre-trained execution plan embeddings can guide SQL query execution without the need for additional model…
Vector set search, an underexplored similarity search paradigm, aims to find vector sets similar to a query set. This search paradigm leverages the inherent structural alignment between sets and real-world entities to model more…
Datalog is a popular logic programming language for deductive reasoning tasks in a wide array of applications, including business analytics, program analysis, and ontological reasoning. However, Datalog's restriction to flat facts over…
Datalogo is an extension of Datalog, where instead of a program being a collection of union of conjunctive queries over the standard Boolean semiring, a program may now be a collection of sum-product queries over an arbitrary commutative…
The lack of standardized tabular formats for tenancy schedules across real estate firms creates significant inefficiencies in data integration. Existing automated integration methods, such as Full Disjunction (FD)-based models like ALITE,…
PathDB is a Java-based graph database designed for in-memory data loading and querying. By utilizing Regular Path Queries (RPQ) and a closed path algebra, PathDB processes paths through its three main components: the parser, the logical…
Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For…
Retrieval-Augmented Generation (RAG) has proven effective on server infrastructures, but its application on mobile devices is still underexplored due to limited memory and power resources. Existing vector search and RAG solutions largely…