数据库
Approximate nearest neighbor search (ANNS) at billion scale is fundamentally an out-of-core problem: vectors and indexes live on SSD, so performance is dominated by I/O rather than compute. Under skewed semantic embeddings, existing…
Column Type Annotation (CTA) is a fundamental step towards enabling schema alignment and semantic understanding of tabular data. Existing encoder-only language models achieve high accuracy when fine-tuned on labeled columns, but their…
Cardinality estimation is a key component of database query optimization. Recent studies have demonstrated that learned cardinality estimation techniques can surpass traditional methods in accuracy. However, a significant barrier to their…
Text-to-SQL systems allow non-SQL experts to interact with relational databases using natural language. However, their tendency to generate executable SQL for ambiguous, out-of-scope, or unanswerable queries introduces a hidden risk, as…
The use of deep learning for database optimization has gained significant traction, offering improvements in indexing, cardinality estimation, and query optimization. However, acquiring high-quality training data remains a significant…
Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced…
Multivariate time series alignment is critical for ensuring coherent analysis across variables, but missing values and timestamp inconsistencies make this task highly challenging. Existing approaches often rely on prior imputation, which…
Structured generation for LLM tool use highlights the value of compact DSL intermediate representations (IRs) that can be emitted directly and parsed deterministically. This paper introduces axial grammar: linear token sequences that…
A hypergraph is a generalization of a graph, in which a hyperedge can connect multiple vertices, modeling complex relationships involving multiple vertices simultaneously. Hypergraph pattern matching, which is to find all isomorphic…
Incorporating utility into targeted pattern mining can address the practical limitations of traditional frequency-based approaches. However, utility-based methods often suffer from generating a large number of long and complicated…
In a quantitative sequential database, numerous efficient algorithms have been developed for high-utility sequential pattern mining (HUSPM). HUSPM establishes a relationship between frequency and significance in the real world and reflects…
The era of GPU-powered data analytics has arrived. In this paper, we argue that recent advances in hardware (e.g., larger GPU memory, faster interconnect and IO, and declining cost) and software (e.g., composable data systems and mature…
Ensuring the FAIRness (Findable, Accessible, Interoperable, Reusable) of data and metadata is an important goal in both research and industry. Knowledge graphs and ontologies have been central in achieving this goal, with interoperability…
Web applications underpin much of modern digital life, yet building scalable and consistent cloud applications remains difficult, requiring expertise across cloud computing, distributed systems, databases, and software engineering. These…
Analyzing flow of objects or data at different granularities of space and time can unveil interesting insights or trends. For example, transportation companies, by aggregating passenger travel data (e.g., counting passengers traveling from…
Subset sampling (also known as Poisson sampling), where the decision to include any specific element in the sample is made independently of all others, is a fundamental primitive in data analytics, enabling efficient approximation by…
We present ModelTables, a benchmark of tables in Model Lakes that captures the structured semantics of performance and configuration tables often overlooked by text only retrieval. The corpus is built from Hugging Face model cards, GitHub…
Most modern Text2SQL systems prompt large language models (LLMs) with entire schemas -- mostly column information -- alongside the user's question. While effective on small databases, this approach fails on real-world schemas that exceed…
Data sharing in large consortia, such as research collaborations or industry partnerships, requires addressing both organizational and technical challenges. A common platform is essential to promote collaboration, facilitate exchange of…
A data product is created with the intention of solving a specific problem, addressing a specific business usecase or meeting a particular need, going beyond just serving data as a raw asset. Data products enable end users to gain greater…