数据库
In industrial and IoT environments, massive amounts of real-time and historical process data are continuously generated and archived. With sensors and devices capturing every operational detail, the volume of time-series data has become a…
LLM serving systems process heterogeneous query workloads where different categories exhibit different characteristics. Code queries cluster densely in embedding space while conversational queries distribute sparsely. Content staleness…
Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and…
Minimizing intermediate results is critical for efficient multi-join query processing. Although the seminal Yannakakis algorithm offers strong guarantees for acyclic queries, cyclic queries remain an open challenge. In this paper, we…
Automatically configuring storage systems is hard: parameter spaces are large and conditions vary across workloads, deployments, and versions. Heuristic and ML tuners are often system specific, require manual glue, and degrade under…
The Open Data Protocol (OData) provides a standardized approach for building and consuming RESTful APIs with rich query capabilities. Despite its power and maturity, OData adoption remains confined primarily to enterprise environments,…
Running data analytics queries on serverless (FaaS) workers has been shown to be cost- and performance-efficient for a variety of real-world scenarios, including intermittent query arrival patterns, sudden load spikes and management…
Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for…
Approximate Nearest Neighbor Search (ANNS) has become a fundamental component in many real-world applications. Among various ANNS algorithms, graph-based methods are state-of-the-art. However, ANNS often suffers from a significant drop in…
Streaming process mining deals with the real-time analysis of event streams. A common approach for it is to adopt windowing mechanisms that select event data from a stream for subsequent analysis. However, the size of these windows denotes…
Retrieval-augmented generation (RAG) improves the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large…
There has been increased interest in data search as a means to find relevant datasets or data points in data lakes and repositories. Although approaches have been proposed to support spatial dataset search and data point search, they…
Even though several publicly accessible pharmacovigilance databases are available, extracting data from them is a technically challenging process. Existing tools typically focus on a single database. We present SurVigilance, an open-source…
Recently, Foursquare released a global dataset with more than 100 million points of interest (POIs), each representing a real-world business on its platform. However, many entries lack complete metadata such as addresses or categories, and…
Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from…
The billboard advertisement has emerged as an effective out-of-home advertisement technique where the objective is to choose a limited number of slots to play some advertisement content (e.g., animation, video, etc.) with the hope that the…
Microservices architectures are an integral part of modern software development. Their adoption brings significant changes to database management. Instead of relying on a single database, a microservices architecture is typically composed…
Finding optimal join orders is among the most crucial steps to be performed by query optimisers. Though extensively studied in data management research, the problem remains far from solved: While query optimisers rely on exhaustive search…
Retrieval-augmented generation (RAG) has emerged as one of the most prominent applications of vector databases. By integrating documents retrieved from a database into the prompt of a large language model (LLM), RAG enables more reliable…
Numerous multi- or high-dimensional indexes with distinct advantages have been proposed on various platforms to meet application requirements. To achieve higher-performance queries, most indexes employ enhancement methods, including…