Related papers: Tupleware: Redefining Modern Analytics

CoVault: A Secure Analytics Platform

Analytics on personal data, such as individuals' mobility, financial, and health data can be of significant benefit to society. Such data is already collected by smartphones, apps and services today, but liberal societies have so far…

Cryptography and Security · Computer Science 2024-01-23 Roberta De Viti , Isaac Sheff , Noemi Glaeser , Baltasar Dinis , Rodrigo Rodrigues , Bobby Bhattacharjee , Anwar Hithnawi , Deepak Garg , Peter Druschel

Topology-based Clusterwise Regression for User Segmentation and Demand Forecasting

Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited. In this work, a system developed for a leading provider of cloud…

Machine Learning · Computer Science 2020-09-09 Rodrigo Rivera-Castro , Aleksandr Pletnev , Polina Pilyugina , Grecia Diaz , Ivan Nazarov , Wanyi Zhu , Evgeny Burnaev

Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning

The development of cluster computing frameworks has allowed practitioners to scale out various statistical estimation and machine learning algorithms with minimal programming effort. This is especially true for machine learning problems…

Machine Learning · Statistics 2019-06-24 Robin Vogel , Aurélien Bellet , Stephan Clémençon , Ons Jelassi , Guillaume Papa

PageRank Pipeline Benchmark: Proposal for a Holistic System Benchmark for Big-Data Platforms

The rise of big data systems has created a need for benchmarks to measure and compare the capabilities of these systems. Big data benchmarks present unique scalability challenges. The supercomputing community has wrestled with these…

Performance · Computer Science 2016-12-13 Patrick Dreher , Chansup Byun , Chris Hill , Vijay Gadepally , Bradley Kuszmaul , Jeremy Kepner

Terabyte-Scale Analytics in the Blink of an Eye

For the past two decades, the DB community has devoted substantial research to take advantage of cheap clusters of machines for distributed data analytics -- we believe that we are at the beginning of a paradigm shift. The scaling laws and…

Databases · Computer Science 2025-08-05 Bowen Wu , Wei Cui , Carlo Curino , Matteo Interlandi , Rathijit Sen

A Survey on Data Processing Methods and Cloud Computation

As new technologies move to the fore, our understanding of the world may seem to have shrunk in comparison, for despite new developments in research, much of it is reduced or rather, abstracted for marketability. Thus, the purpose of this…

Computers and Society · Computer Science 2017-01-24 Katherine Hughes

NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud

There is an increasing interest in executing complex analyses over large graphs, many of which require processing a large number of multi-hop neighborhoods or subgraphs. Examples include ego network analysis, motif counting, personalized…

Databases · Computer Science 2015-10-01 Abdul Quamar , Amol Deshpande , Jimmy Lin

MultiTab: A Scalable Foundation for Multitask Learning on Tabular Data

Tabular data is the most abundant data type in the world, powering systems in finance, healthcare, e-commerce, and beyond. As tabular datasets grow and span multiple related targets, there is an increasing need to exploit shared task…

Machine Learning · Computer Science 2025-11-14 Dimitrios Sinodinos , Jack Yi Wei , Narges Armanfard

Exploration of TPUs for AI Applications

Tensor Processing Units (TPUs) are specialized hardware accelerators for deep learning developed by Google. This paper aims to explore TPUs in cloud and edge computing focusing on its applications in AI. We provide an overview of TPUs,…

Hardware Architecture · Computer Science 2023-11-15 Diego Sanmartín Carrión , Vera Prohaska

Templating Shuffles

Cloud data centers are evolving fast. At the same time, today's large-scale data analytics applications require non-trivial performance tuning that is often specific to the applications, workloads, and data center infrastructure. We propose…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-11 Qizhen Zhang , Jiacheng Wu , Ang Chen , Vincent Liu , Boon Thau Loo

A Rapid Review of Clustering Algorithms

Clustering algorithms aim to organize data into groups or clusters based on the inherent patterns and similarities within the data. They play an important role in today's life, such as in marketing and e-commerce, healthcare, data…

Machine Learning · Computer Science 2024-01-17 Hui Yin , Amir Aryani , Stephen Petrie , Aishwarya Nambissan , Aland Astudillo , Shengyuan Cao

Create Benchmarks for Data Lakes

Data lakes have emerged as a flexible and scalable solution for storing and analyzing large volumes of heterogeneous data, including structured, semi-structured, and unstructured formats. Despite their growing adoption in both industry and…

Databases · Computer Science 2026-01-28 Yi Lyu , Pei-Chieh Lo , Natan Lidukhover

Petuum: A New Platform for Distributed Machine Learning on Big Data

What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization…

Machine Learning · Statistics 2015-05-18 Eric P. Xing , Qirong Ho , Wei Dai , Jin Kyu Kim , Jinliang Wei , Seunghak Lee , Xun Zheng , Pengtao Xie , Abhimanu Kumar , Yaoliang Yu

Analytics-as-a-Service in a Multi-Cloud Environment through Semantically enabled Hierarchical Data Processing

A large number of cloud middleware platforms and tools are deployed to support a variety of Internet of Things (IoT) data analytics tasks. It is a common practice that such cloud platforms are only used by its owners to achieve their…

Networking and Internet Architecture · Computer Science 2016-06-28 Prem Prakash Jayaraman , Charith Perera , Dimitrios Georgakopoulos , Schahram Dustdar , Dhavalkumar Thakker , Rajiv Ranjan

DAMEWARE: A web cyberinfrastructure for astrophysical data mining

Astronomy is undergoing through a methodological revolution triggered by an unprecedented wealth of complex and accurate data. The new panchromatic, synoptic sky surveys require advanced tools for discovering patterns and trends hidden…

Instrumentation and Methods for Astrophysics · Physics 2015-06-22 Massimo Brescia , Stefano Cavuoti , Giuseppe Longo , Alfonso Nocella , Mauro Garofalo , Francesco Manna , Francesco Esposito , Giovanni Albano , Marisa Guglielmo , Giovanni D'Angelo , Alessandro Di Guido , George S. Djorgovski , Ciro Donalek , Ashish A. Mahabal , Matthew J. Graham , Michelangelo Fiore , Raffaele D'Abrusco

Towards observability of scientific applications

As software systems increase in complexity, conventional monitoring methods struggle to provide a comprehensive overview or identify performance issues, often missing unexpected problems. Observability, however, offers a holistic approach,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-29 Bartosz Balis , Konrad Czerepak , Albert Kuzma , Jan Meizner , Lukasz Wronski

Mathematical Frameworks for Pricing in the Cloud: Revenue, Fairness, and Resource Allocations

As more and more users begin to use the cloud for their computing needs, datacenter operators are increasingly pressed to effectively allocate their resources among these client users. Yet while much work has been done in this area,…

Computers and Society · Computer Science 2012-12-11 Carlee Joe-Wong , Soumya Sen

AlphaClean: Automatic Generation of Data Cleaning Pipelines

The analyst effort in data cleaning is gradually shifting away from the design of hand-written scripts to building and tuning complex pipelines of automated data cleaning libraries. Hyper-parameter tuning for data cleaning is very different…

Databases · Computer Science 2019-05-08 Sanjay Krishnan , Eugene Wu

A Glimpse of the Matrix (Extended Version): Scalability Issues of a New Message-Oriented Data Synchronization Middleware

Matrix is a new message-oriented data synchronization middleware, used as a federated platform for near real-time decentralized applications. It features a novel approach for inter-server communication based on synchronizing message history…

Networking and Internet Architecture · Computer Science 2019-12-02 Florian Jacob , Jan Grashöfer , Hannes Hartenstein

Distributed Log Analysis on the Cloud Using MapReduce

In this paper we describe our work on designing a web based, distributed data analysis system based on the popular MapReduce framework deployed on a small cloud; developed specifically for analyzing web server logs. The log analysis system…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-13 Galip Aydin , Ibrahim Riza Hallac