数据库
With massive texts on social media, users and analysts often rely on topic modeling techniques to quickly extract key themes and gain insights. Traditional topic modeling techniques, such as Latent Dirichlet Allocation (LDA), provide…
The performance of database management systems (DBMS) is traditionally evaluated using benchmarks that focus on workloads with (almost) fixed record lengths. However, some real-world workloads in key/value stores, document databases, and…
RNA-KG is a recently developed knowledge graph that integrates the interactions involving coding and non-coding RNA molecules extracted from public data sources. It can be used to support the classification of new molecules, identify new…
In the era of generative AI, ensuring the privacy of music data presents unique challenges: unlike static artworks such as images, music data is inherently temporal and multimodal, and it is sampled, transformed, and remixed at an…
AI-augmented data workflows introduce complex governance challenges, as both human and model-driven processes generate, transform, and consume data artifacts. These workflows blend heterogeneous tools, dynamic execution patterns, and opaque…
The development, integration, and maintenance of geospatial databases rely heavily on efficient and accurate matching procedures of Geospatial Entity Resolution (ER). While resolution of points-of-interest (POIs) has been widely addressed,…
Panel proposal for an open forum to discuss and debate the future of database research in the context of industry, other research communities, and AI. Includes summaries of past panels, positions from panelists, as well as positions from a…
Cloud gaming has gained popularity as it provides high-quality gaming experiences on thin hardware, such as phones and tablets. Transmitting gameplay frames at high resolutions and ultra-low latency is the key to guaranteeing players'…
Floating-point data is widely used across various domains. Depending on the required precision, each floating-point value can occupy several bytes. Lossless storage of this information is crucial due to its critical accuracy, as seen in…
With ever-increasing volume and heterogeneity of data, advent of new specialized compute engines, and demand for complex use cases, large-scale data systems require a performant catalog system that can satisfy diverse needs. We argue that…
The rapid growth of online retail and e-commerce has made effective and efficient Vehicle Routing Problem (VRP) solutions essential. To meet rising demand, companies are adding more depots, which changes the VRP problem to a complex…
In this paper, we propose Data-Aware Socratic Guidance (DASG), a dialogue-based query enhancement framework that embeds \linebreak interactive clarification as a first-class operator within database systems to resolve ambiguity in natural…
Existing unstructured data analytics systems rely on experts to write code and manage complex analysis workflows, making them both expensive and time-consuming. To address these challenges, we introduce AgenticData, an innovative agentic…
As large language models (LLMs) demonstrate increasingly powerful reasoning and orchestration capabilities, LLM-based agents are rapidly proliferating for complex data-related tasks. Despite this progress, the current design of how LLMs…
We introduce Raqlet, a source-to-source compilation framework that addresses the fragmentation of recursive querying engines spanning relational (recursive SQL), graph (Cypher, GQL), and deductive (Datalog) systems. Recent standards such as…
Entity resolution (ER) remains a significant challenge in data management, especially when dealing with large datasets. This paper introduces MERAI (Massive Entity Resolution using AI), a robust and efficient pipeline designed to address…
Modern large-scale services such as search engines, messaging platforms, and serverless functions, rely on key-value (KV) stores to maintain high performance at scale. When such services are deployed in constrained memory environments, they…
This paper proposes Oze, a concurrency control protocol that handles heterogeneous workloads, including long-running update transactions. Oze explores a large scheduling space using a multi-version serialization graph to reduce false…
Indexes can significantly improve search performance in relational databases. However, if the query workload changes frequently or new data updates occur continuously, it may not be worthwhile to build a conventional index upfront for query…
Modern data analytic workloads increasingly require handling multiple data models simultaneously. Two primary approaches meet this need: polyglot persistence and multi-model database systems. Polyglot persistence employs a coordinator…