Computer Science
Modern table formats such as Apache Iceberg compute and store metadata-commit timestamps, record counts, and column-level statistics such as null counts and value bounds at write time as part of file writing. These statistics serve query…
Deciding periodicity of infinite words generated by morphisms is a classical result in combinatorics on words from 80's by Harju, Linna and Pansiot. In this paper, we are interested in this question in the abelian setting. Two words are…
LLM-driven social bots can generate fluent, human-like text, reducing the discriminative advantage of content-based detection alone. However, coordinated campaigns still leave relational patterns -- interactions, behavioral similarity,…
Geo-distributed OLTP databases are widely deployed across cloud regions, yet current evaluation practices do not cover the challenges of this aspect. Existing benchmarks assume stable network conditions; they lack explicit settings for data…
Given a social network represented as a graph where the nodes are the users and the edges represent the social relations, and a positive integer k, how to select k nodes to maximize the influence in the network remains an active area of…
Text-to-Visualization (Text-to-Vis) translates natural language queries into visualization query languages, enabling non-expert users to perform data analysis. However, most existing methods follow a one-shot paradigm that requires users to…
Given a connected graph $G$ and a terminal set $R \subseteq V(G)$, the minimum Steiner tree problem (ST) asks for a tree that spans all of $R$ with at most $r$ vertices from $V(G)\backslash R$, for some integer $r\geq 0$. A \emph{split…
Social media platforms have become a major vector for the large-scale dissemination of misinformation and conspiracy content, posing significant risks to public trust, health, and societal stability. While prior work has primarily focused…
As server CPUs scale to dozens and now hundreds of cores per socket, parallel query engines must rethink how they redistribute data between threads. Partitioned operators such as hash joins and aggregations require frequent data…
In cloud data platforms, developers often encounter performance regressions that occur in specific tenant datasets. However, due to confidentiality constraints, they cannot access the original data, which makes it difficult to reproduce…
Oracle Exadata consolidates thousands of tenant databases onto shared storage infrastructure deployed at hundreds of customer sites worldwide. Oracle Multitenant architecture enables this extreme density, with thousands of tenant databases…
Approximate k-Nearest Neighbor (AKNN) search is widely used in vector databases. When vectors carry additional attributes (e.g., labels or numerical values), filtered AKNN search retrieves the nearest vectors to a query vector under…
Data transformation correctness is a fundamental challenge in data engineering: how can we verify that pipelines produce correct results before executing on production data? Existing practice relies on iterative testing over materialized…
Workload traces from cloud data warehouse providers reveal that standard benchmarks such as TPC-H and TPC-DS fail to capture key characteristics of real-world workloads, including query repetition and string-heavy queries. In this paper, we…
The application of program transformation and algebraic methods to the development of efficient combinatorial optimization (CO) algorithms relies on an exhaustive combinatorial generator for the problem specification, followed by the fusion…
Large-scale disasters, such as pandemics and climate-related events, place extraordinary pressure on healthcare providers due to extreme demand surges. Managing these surges is essential to sustaining healthcare resilience. Although…
A fundamental step in knowledge discovery is statistically assessing data mining results. In network analysis, such evaluation compares the outcome of a given procedure with the outcomes obtained from randomized versions of the observed…
Large Language Model (LLM) serving systems must balance task performance against monetary cost. Two prominent optimization techniques have emerged independently: LLM routing, which directs each query to the most cost-effective model in a…
The ubiquity of social platforms has reshaped the way information, behaviors, and advertisements diffuse across networks, with influence propagation often initiated by a small set of ``seed'' users. While much of the literature emphasizes…
Financial and economic research often relies on structured supply-chain disclosures and commercial databases. In China, supplier--customer disclosure is typically limited to major partners of listed firms, leaving unlisted firms and…