Related papers: Query Complexity Based Optimal Processing of Raw D…
Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning.…
Scientific experiments, simulations, and modern applications generate large amounts of data. Data is stored in raw format to avoid the high loading time of traditional database management systems. Researchers have proposed many techniques…
Traditional databases are not equipped with the adequate functionality to handle the volume and variety of "Big Data". Strict schema definition and data loading are prerequisites even for the most primitive query session. Raw data…
Scientific experiments and modern applications are generating large amounts of data every day. Most organizations utilize In-house servers or Cloud resources to manage application data and workload. The traditional database management…
Modern cloud-based data analytics systems must efficiently process petabytes of data residing on cloud storage. A key optimization technique in state-of-the-art systems like Snowflake is partition pruning - skipping chunks of data that do…
Large-scale datasets in the form of knowledge graphs are often used in numerous domains, today. A knowledge graphs size often exceeds the capacity of a single computer system, especially if the graph must be stored in main memory. To…
Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions.…
Machine learning systems have been extensively used as auxiliary tools in domains that require critical decision-making, such as healthcare and criminal justice. The explainability of decisions is crucial for users to develop trust on these…
In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in…
Deep learning has made remarkable progress recently, largely due to the availability of large, well-labeled datasets. However, the training on such datasets elevates costs and computational demands. To address this, various techniques like…
Formulating efficient SQL queries requires several cycles of tuning and execution, particularly for inexperienced users. We examine methods that can accelerate and improve this interaction by providing insights about SQL queries prior to…
This paper introduces the Partition Tree Weighting technique, an efficient meta-algorithm for piecewise stationary sources. The technique works by performing Bayesian model averaging over a large class of possible partitions of the data…
Query processing over big data is ubiquitous in modern clouds, where the system takes care of picking both the physical query execution plans and the resources needed to run those plans, using a cost-based query optimizer. A good cost…
Answering complex logical queries on incomplete knowledge graphs is a challenging task, and has been widely studied. Embedding-based methods require training on complex queries, and cannot generalize well to out-of-distribution query…
The emerging citation-based QA systems are gaining more attention especially in generative AI search applications. The importance of extracted knowledge provided to these systems is vital from both accuracy (completeness of information) and…
Answering complex questions over knowledge bases (KB-QA) faces huge input data with billions of facts, involving millions of entities and thousands of predicates. For efficiency, QA systems first reduce the answer search space by…
Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within…
The modern day semantic applications store data as Resource Description Framework (RDF) data.Due to Proliferation of RDF Data, the efficient management of huge RDF data has become essential. A number of approaches pertaining to both…
Deep neural networks, while achieving remarkable success across diverse tasks, demand significant resources, including computation, GPU memory, bandwidth, storage, and energy. Network quantization, as a standard compression and acceleration…
Quantization-aware training (QAT) is a representative model compression method to reduce redundancy in weights and activations. However, most existing QAT methods require end-to-end training on the entire dataset, which suffers from long…