Related papers: REX: Recursive, Delta-Based Data-Centric Computati…

Helix: Holistic Optimization for Accelerating Iterative Machine Learning

Machine learning workflow development is a process of trial-and-error: developers iterate on workflows by testing out small modifications until the desired accuracy is achieved. Unfortunately, existing machine learning systems focus…

Databases · Computer Science 2018-12-17 Doris Xin , Stephen Macke , Litian Ma , Jialin Liu , Shuchen Song , Aditya Parameswaran

Scaling-Up In-Memory Datalog Processing: Observations and Techniques

Recursive query processing has experienced a recent resurgence, as a result of its use in many modern application domains, including data integration, graph analytics, security, program analysis, networking and decision making. Due to the…

Databases · Computer Science 2018-12-11 Zhiwei Fan , Jianqiao Zhu , Zuyu Zhang , Aws Albarghouthi , Paraschos Koutris , Jignesh Patel

Declarative Recursive Computation on an RDBMS, or, Why You Should Use a Database For Distributed Machine Learning

A number of popular systems, most notably Google's TensorFlow, have been implemented from the ground up to support machine learning tasks. We consider how to make a very small set of changes to a modern relational database management system…

Databases · Computer Science 2019-04-26 Dimitrije Jankov , Shangyu Luo , Binhang Yuan , Zhuhua Cai , Jia Zou , Chris Jermaine , Zekai J. Gao

Iterative MapReduce for Large Scale Machine Learning

Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-03-15 Joshua Rosen , Neoklis Polyzotis , Vinayak Borkar , Yingyi Bu , Michael J. Carey , Markus Weimer , Tyson Condie , Raghu Ramakrishnan

Spinning Fast Iterative Data Flows

Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk…

Databases · Computer Science 2012-08-02 Stephan Ewen , Kostas Tzoumas , Moritz Kaufmann , Volker Markl

Hyperdimensional Hashing: A Robust and Efficient Dynamic Hash Table

Most cloud services and distributed applications rely on hashing algorithms that allow dynamic scaling of a robust and efficient hash table. Examples include AWS, Google Cloud and BitTorrent. Consistent and rendezvous hashing are algorithms…

Data Structures and Algorithms · Computer Science 2022-05-17 Mike Heddes , Igor Nunes , Tony Givargis , Alexandru Nicolau , Alex Veidenbaum

Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms

The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators…

Databases · Computer Science 2021-09-30 Dimitrios Koutsoukos , Ingo Müller , Renato Marroquín , Ana Klimovic , Gustavo Alonso

Medusa: An Efficient Cloud Fault-Tolerant MapReduce

Applications such as web search and social networking have been moving from centralized to decentralized cloud architectures to improve their scalability. MapReduce, a programming framework for processing large amounts of data using…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-24 Pedro A. R. S. Costa , Xiao Bai , Fernando M. V. Ramos , Miguel Correia

Towards Interactive, Adaptive and Result-aware Big Data Analytics

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-15 Avinash Kumar

A Case for A Collaborative Query Management System

Over the past 40 years, database management systems (DBMSs) have evolved to provide a sophisticated variety of data management capabilities. At the same time, tools for managing queries over the data have remained relatively primitive. One…

Databases · Computer Science 2009-09-15 Nodira Khoussainova , Magda Balazinska , Wolfgang Gatterbauer , YongChul Kwon , Dan Suciu

Reliable Data Storage in Distributed Hash Tables

Distributed Hash Tables offer a resilient lookup service for unstable distributed environments. Resilient data storage, however, requires additional data replication and maintenance algorithms. These algorithms can have an impact on both…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Matthew Leslie

Helix: Accelerating Human-in-the-loop Machine Learning

Data application developers and data scientists spend an inordinate amount of time iterating on machine learning (ML) workflows -- by modifying the data pre-processing, model training, and post-processing steps -- via trial-and-error to…

Machine Learning · Computer Science 2018-08-06 Doris Xin , Litian Ma , Jialin Liu , Stephen Macke , Shuchen Song , Aditya Parameswaran

RecTen: A Recursive Hierarchical Low Rank Tensor Factorization Method to Discover Hierarchical Patterns in Multi-modal Data

How can we expand the tensor decomposition to reveal a hierarchical structure of the multi-modal data in a self-adaptive way? Current tensor decomposition provides only a single layer of clusters. We argue that with the abundance of…

Information Retrieval · Computer Science 2020-11-17 Risul Islam , Md Omar Faruk Rokon , Evangelos E. Papalexakis , Michalis Faloutsos

DRESS: Dynamic RESource-reservation Scheme for Congested Data-intensive Computing Platforms

In the past few years, we have envisioned an increasing number of businesses start driving by big data analytics, such as Amazon recommendations and Google Advertisements. At the back-end side, the businesses are powered by big data…

Performance · Computer Science 2021-10-26 Ying Mao , Victoria Green , Jiayin Wang , Haoyi Xiong , Zhishan Guo

A Dynamic Data Middleware Cache for Rapidly-growing Scientific Repositories

Modern scientific repositories are growing rapidly in size. Scientists are increasingly interested in viewing the latest data as part of query results. Current scientific middleware cache systems, however, assume repositories are static.…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-09-21 Tanu Malik , Xiaodan Wang , Philip Little , Amitabh Chaudhary , Ani Thakar

D-Rex: Heterogeneity-Aware Reliability Framework and Adaptive Algorithms for Distributed Storage

The exponential growth of data necessitates distributed storage models, such as peer-to-peer systems and data federations. While distributed storage can reduce costs and increase reliability, the heterogeneity in storage capacity, I/O…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-04 Maxime Gonthier , Dante D. Sanchez-Gallegos , Haochen Pan , Bogdan Nicolae , Sicheng Zhou , Hai Duc Nguyen , Valerie Hayot-Sasson , J. Gregory Pauloski , Jesus Carretero , Kyle Chard , Ian Foster

DReX: An Explainable Deep Learning-based Multimodal Recommendation Framework

Multimodal recommender systems leverage diverse data sources, such as user interactions, content features, and contextual information, to address challenges like cold-start and data sparsity. However, existing methods often suffer from one…

Information Retrieval · Computer Science 2026-02-24 Adamya Shyam , Venkateswara Rao Kagita , Bharti Rana , Vikas Kumar

Analyzing Large-Scale, Distributed and Uncertain Data

The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…

Databases · Computer Science 2017-12-06 Yaron Gonen

DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended Version)

Parallel shared-nothing data management systems have been widely used to exploit a cluster of machines for efficient and scalable data processing. When a cluster needs to be dynamically scaled in or out, data must be efficiently rebalanced.…

Databases · Computer Science 2021-05-25 Chen Luo , Michael J. Carey

Towards CXL Resilience to CPU Failures

Compute Express Link (CXL) 3.0 and beyond allows the compute nodes of a cluster to share data with hardware cache coherence and at the granularity of a cache line. This enables shared-memory semantics for distributed computing, but…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-10 Antonis Psistakis , Burak Ocalan , Chloe Alverti , Fabien Chaix , Ramnatthan Alagappan , Josep Torrellas