数据库
Transforming natural language into SQL queries (NL2SQL) is crucial for data-driven business applications. Existing frameworks, trained on open-source datasets, struggle with complex business logic and lack domain-specific data for…
Concurrent accesses to databases are typically grouped in transactions which define units of work that should be isolated from other concurrent computations and resilient to failures. Modern databases provide different levels of isolation…
Approximate nearest neighbor search (ANNS) is a fundamental problem in vector databases and AI infrastructures. Recent graph-based ANNS algorithms have achieved high search accuracy with practical efficiency. Despite the advancements, these…
Mining information from graph databases is becoming overly important. To approach this problem, current methods focus on identifying subgraphs with specific topologies; as of today, no work has been dedicated to jointly expressing the…
Recent research found that cloud data warehouses are text-heavy. However, their capabilities for efficiently processing string columns remain limited, relying primarily on techniques like dictionary encoding and prefix-based partition…
Log data is a vital resource for capturing system events and states. With the increasing complexity and widespread adoption ofmodern software systems and IoT devices, the daily volume of log generation has surged to tens of petabytes,…
Interactions between two entities often occur at specific timestamps, which can be modeled as a temporal graph. Exploring the relationships between vertices based on temporal paths is one of the fundamental tasks. In this paper, we conduct…
LSM-tree is a widely adopted data structure in modern key-value store systems that optimizes write performance in write-heavy applications by using append writes to achieve sequential writes. However, the unpredictability of LSM-tree…
Efficiently re-identifying and tracking objects across a network of cameras is crucial for applications like traffic surveillance. Spatula is the state-of-the-art video database management system (VDBMS) for processing Re-ID queries.…
This paper addresses emerging system-level challenges in heterogeneous retrieval-augmented generation (RAG) serving, where complex multi-stage workflows and diverse request patterns complicate efficient execution. We present HedraRAG, a…
The remarkable performance of Large Language Models (LLMs) has inspired many applications, which often necessitate edge-cloud collaboration due to connectivity, privacy, and cost considerations. Traditional methods primarily focus on…
Error-bounded lossy compression has been widely adopted in many scientific domains because it can address the challenges in storing, transferring, and analyzing unprecedented amounts of scientific data. Although error-bounded lossy…
Pattern set mining, which is the task of finding a good set of patterns instead of all patterns, is a fundamental problem in data mining. Many different definitions of what constitutes a good set have been proposed in recent years. In this…
We present ONION, a multi-layered framework for participatory Entity-Relationship (ER) modeling that integrates insights from design justice, participatory AI, and conceptual modeling. ONION introduces a five-stage methodology: Observe,…
Shapes Constraint Language (SHACL) is a powerful language for validating RDF data. Given the recent industry attention to Knowledge Graphs (KGs), more users need to validate linked data properly. However, traditional SHACL validation…
The rise of LLM has enabled natural language-based table assistants, but existing systems assume users already have a well-formed table, neglecting the challenge of table discovery in large-scale table pools. To address this, we introduce…
Most recently, researchers have started building large language models (LLMs) powered data systems that allow users to analyze unstructured text documents like working with a database because LLMs are very effective in extracting attributes…
Dataspaces are designed to support sovereign, trusted and decentralized data exchange between participants forming an ecosystem. They are standardized by initiatives such as the International Data Spaces Association or Gaia-X and have…
Query optimization is a fundamental task in database systems that is crucial to providing high performance. To evaluate learned and traditional optimizer's performance, several benchmarks, such as the widely used JOB benchmark, are used.…
Learned cardinality estimators show promise in query cardinality prediction, yet they universally exhibit fragility to training data drifts, posing risks for real-world deployment. This work is the first to theoretical investigate how…