Related papers: Oseba: Optimization for Selective Bulk Analysis in…

Design and Performance Evaluation of an Optimized Disk Scheduling Algorithm (ODSA)

Management of disk scheduling is a very important aspect of operating system. Performance of the disk scheduling completely depends on how efficient is the scheduling algorithm to allocate services to the request in a better manner. Many…

Operating Systems · Computer Science 2014-03-04 Sourav Kumar Bhoi , Sanjaya Kumar Panda , Imran Hossain Faruk

Orthogonal Subsampling for Big Data Linear Regression

The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal…

Methodology · Statistics 2021-06-01 Lin Wang , Jake Elmstedt , Weng Kee Wong , Hongquan Xu

Slow Kill for Big Data Learning

Big-data applications often involve a vast number of observations and features, creating new challenges for variable selection and parameter estimation. This paper presents a novel technique called ``slow kill,'' which utilizes nonconvex…

Machine Learning · Statistics 2023-05-04 Yiyuan She , Jianhui Shen , Adrian Barbu

Subdata selection for big data regression: an improved approach

In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may…

Methodology · Statistics 2024-11-13 Vasilis Chasiotis , Dimitris Karlis

Computationally and Memory-Efficient Robust Predictive Analytics Using Big Data

In the current data-intensive era, big data has become a significant asset for Artificial Intelligence (AI), serving as a foundation for developing data-driven models and providing insight into various unknown fields. This study navigates…

Machine Learning · Computer Science 2024-07-04 Daniel Menges , Adil Rasheed

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-07 Ubaid Ullah Hafeez , Martin Maas , Mustafa Uysal , Richard McDougall

Scaling LLM Inference with Optimized Sample Compute Allocation

Sampling is a basic operation in many inference-time algorithms of large language models (LLMs). To scale up inference efficiently with a limited compute, it is crucial to find an optimal allocation for sample compute budgets: Which…

Computation and Language · Computer Science 2024-10-31 Kexun Zhang , Shang Zhou , Danqing Wang , William Yang Wang , Lei Li

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-17 Shanjiang Tang , Bingsheng He , Ce Yu , Yusen Li , Kun Li

Online Rack Placement in Large-Scale Data Centers: Online Sampling Optimization and Deployment

This paper optimizes the configuration of large-scale data centers toward cost-effective, reliable and sustainable cloud supply chains. The problem involves placing incoming racks of servers within a data center to maximize demand coverage…

Optimization and Control · Mathematics 2026-01-19 Saumil Baxi , Kayla Cummings , Alexandre Jacquillat , Sean Lo , Rob McDonald , Konstantina Mellou , Ishai Menache , Marco Molinaro

Data-Oblivious External-Memory Algorithms for the Compaction, Selection, and Sorting of Outsourced Data

We present data-oblivious algorithms in the external-memory model for compaction, selection, and sorting. Motivation for such problems comes from clients who use outsourced data storage services and wish to mask their data access patterns.…

Data Structures and Algorithms · Computer Science 2011-03-29 Michael T. Goodrich

From Big Data to Fast Data: Towards High-Quality Datasets for Machine Learning Applications from Closed-Loop Data Collection

The increasing capabilities of machine learning models, such as vision-language and multimodal language models, are placing growing demands on data in automotive systems engineering, making the quality and relevance of collected data…

Systems and Control · Electrical Eng. & Systems 2026-04-01 Philipp Reis , Jacqueline Henle , Stefan Otten , Eric Sax

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-06 Jonathan Will , Lauritz Thamsen , Jonathan Bader , Dominik Scheinert , Odej Kao

Towards Interactive, Adaptive and Result-aware Big Data Analytics

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-12-15 Avinash Kumar

Memory Enriched Big Bang Big Crunch Optimization Algorithm for Data Clustering

Cluster analysis plays an important role in decision making process for many knowledge-based systems. There exist a wide variety of different approaches for clustering applications including the heuristic techniques, probabilistic models,…

Artificial Intelligence · Computer Science 2017-03-09 Kayvan Bijari , Hadi Zare , Hadi Veisi , Hossein Bobarshad

Optimizing I/O for Big Array Analytics

Big array analytics is becoming indispensable in answering important scientific and business questions. Most analysis tasks consist of multiple steps, each making one or multiple passes over the arrays to be analyzed and generating…

Databases · Computer Science 2012-04-30 Yi Zhang , Jun Yang

Data Source Selection for Information Integration in Big Data Era

In Big data era, information integration often requires abundant data extracted from massive data sources. Due to a large number of data sources, data source selection plays a crucial role in information integration, since it is costly and…

Databases · Computer Science 2016-11-01 Yiming Lin , Hongzhi Wang , Jianzhong Li , Hong Gao

A Framework of Sparse Online Learning and Its Applications

The amount of data in our society has been exploding in the era of big data today. In this paper, we address several open challenges of big data stream classification, including high volume, high velocity, high dimensionality, high…

Machine Learning · Computer Science 2015-07-28 Dayong Wang , Pengcheng Wu , Peilin Zhao , Steven C. H. Hoi

Fast Optimization of Weighted Sparse Decision Trees for use in Optimal Treatment Regimes and Optimal Policy Design

Sparse decision trees are one of the most common forms of interpretable models. While recent advances have produced algorithms that fully optimize sparse decision trees for prediction, that work does not address policy design, because the…

Machine Learning · Computer Science 2022-10-27 Ali Behrouz , Mathias Lecuyer , Cynthia Rudin , Margo Seltzer

INGESTBASE: A Declarative Data Ingestion System

Big data applications have fast arriving data that must be quickly ingested. At the same time, they have specific needs to preprocess and transform the data before it could be put to use. The current practice is to do these preparatory…

Databases · Computer Science 2017-01-24 Alekh Jindal , Jorge-Arnulfo Quiane-Ruiz , Samuel Madden

Robust, scalable and fast bootstrap method for analyzing large scale data

In this paper we address the problem of performing statistical inference for large scale data sets i.e., Big Data. The volume and dimensionality of the data may be so high that it cannot be processed or stored in a single computing node. We…

Methodology · Statistics 2016-04-20 Shahab Basiri , Esa Ollila , Visa Koivunen