数据库
The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leveraging GPU parallelism for this task…
In this paper, we present a vision for a new generation of multimodal streaming systems that embed MLLMs as first-class operators, enabling real-time query processing across multiple modalities. Achieving this is non-trivial: while recent…
Vector data trading is essential for cross-domain learning with vector databases, yet it remains largely unexplored. We study this problem under online learning, where sellers face uncertain retrieval costs and buyers provide stochastic…
Query optimization has been studied using machine learning, reinforcement learning, and, more recently, graph-based convolutional networks. Ontology, as a structured, information-rich knowledge representation, can provide context,…
Enterprises often maintain multiple databases for storing critical business data in siloed systems, resulting in inefficiencies and challenges with data interoperability. A key to overcoming these challenges lies in integrating disparate…
We introduce MemoriesDB, a unified data architecture designed to avoid decoherence across time, meaning, and relation in long-term computational memory. Each memory is a time-semantic-relational entity-a structure that simultaneously…
LSM-trees are featured by out-of-place updates, where key deletion is handled by inserting a tombstone to mark its staleness instead of removing it in place. This defers actual removal to compactions with greatly reduced overhead. However,…
Recent research has demonstrated the complementary nature of camera-based and inertial data for modeling human gestures, activities, and sentiment. Yet, despite its growing importance for environmental sensing as well as the advance of…
Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and…
The development of practical query languages for graph databases runs well ahead of the underlying theory. The ISO committee in charge of database query languages is currently developing a new standard called Graph Query Language (GQL) as…
Neural embedding models are extensively employed in the table union search problem, which aims to find semantically compatible tables that can be merged with a given query table. In particular, multi-vector models, which represent a table…
Configuration tuning is critical for database performance. Although recent advancements in database tuning have shown promising results in throughput and latency improvement, challenges remain. First, the vast knob space makes direct…
The detection of sequential patterns in data is a basic functionality of modern data processing systems for complex event processing (CEP), OLAP, and retrieval-augmented generation (RAG). In practice, pattern matching is challenging, since…
Large Language Model-based (LLM-based) Text-to-SQL methods have achieved important progress in generating SQL queries for real-world applications. When confronted with table content-aware questions in real-world scenarios, ambiguous data…
Unstructured data, in the form of text, images, video, and audio, is produced at exponentially higher rates. In tandem, machine learning (ML) methods have become increasingly powerful at analyzing unstructured data. Modern ML methods can…
Data provenance has numerous applications in the context of data preparation pipelines. It can be used for debugging faulty pipelines, interpreting results, verifying fairness, and identifying data quality issues, which may affect the…
Traditional ETL and ELT design patterns struggle to meet modern requirements of scalability, governance, and real-time data processing. Hybrid approaches such as ETLT (Extract-Transform-Load-Transform) and ELTL (Extract-Load-Transform-Load)…
Despite several works that succeed in generating synthetic data with differential privacy (DP) guarantees, they are inadequate for generating high-quality synthetic data when the input data has missing values. In this work, we formalize the…
Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to…
Data lakes enable easy maintenance of heterogeneous data in its native form. While this flexibility can accelerate data ingestion, it shifts the complexity of data preparation and query processing to data discovery tasks. One such task is…